Designing Web Scrapers with Scrapy

A web scraper allows you to programmatically collect data from web pages without an official API. In the following tutorial, we will step through the installation and customization of fully expandable Python web scraper. With increasing complexity, you’ll need to equip your scraper to handle concurrency—this allows you to crawl multiple pages at once. Scrapy differs from Beautiful Soup, which parses html, because it is a full blow web scraping framework.

1. Install Scrapy

To install the Scrapy python module, you’ll first need pip, also known as PyPI — the Python Package Index containing all published Python software. My previous tutorial focused on Beautiful Soup, goes into detail about the full Python, pip, and virtual environment installation steps. Assuming these are installed on your system, let’s dive in to installing the Scrapy module by running the following command within your selected virtual environment:

$ pip install scrapy


Next, let’s create a new folder for your scraper called scrapy-scraper and then enter it:

$ mkdir scrapy-scraper
$ cd scrapy-scraper


Create a new Python file called scraper.py—in this example, we’ll use nano. This file will contain all of the code for our scraper application.

$ nano scraper.py


Create a Python class with subclass scrapy.Spider. The .Spider class is provided by the Scrapy library. In scraper.py, import the scrapy package and specify a subclass NewSpider. This subclass will build on and further customize the Spider class provided by Scrapy which uses methods and behaviors to define how to follow URLs and extract specific data. name will contain a user-generated name for your scraper bot. start_urls will contain a complete list of the URLs that the scraper will crawl from.

import scrapy

class NewSpider(scrapy.Spider):
name = “new_spider”
start_urls = [‘https://www.website.com’]


Python files can be run by entering in the python command followed by the file path:

$ scrapy runspider scraper.py


When the file runs, it reads from your targeted URL(s) and grabs the HTML. The HTML is then passed to a parse method—which we will need to specify in terms of what data we need to cherry-pick.

2. Data Extraction from a URL

Let’s begin by reviewing the source HTML of the targeted URL, looking for patterns and taking note of table structure and specific table, div, class or id names relating to the data you’d like pulled. These named items are called selectors. Scrapy is optimized to pull CSS or XPath (XML Path language) selectors. Notably, CSS selectors are converted to XPath under-the-hood.


You can cherry-pick data from a given selector by simply passing the selector to the response object. The format for passing all information associated with a specific selector appears below:


def parse(self, response):
SET_SELECTOR = ‘.date’
for new_spider in response.css(SET_SELECTOR):
pass


To pull selectors only associated with another defined tag, use the following structure:

def parse(self, response):
SET_SELECTOR = ‘.date’
for new_spider in response.css(SET_SELECTOR):

NAME_SELECTOR = ‘h1 a ::text’
yield {
‘name: new_spider.css(NAME_SELECTOR).extract_first(),
}


In the code snippet directly above, note the use of a CSS pseudo-selector ::text. This fetches data inside of the a tag rather than the self-referencing the entire tag. Also note the use of extract_first() Using the most relevant format from above above, grab a useful bit of data by modifying this code structure to fit your needs from within your .py file.


Save and exit your .py file, and run your updated scraper with the following command:

$ scrapy runspider scraper.py

3. Scraping Several URLs

Your scraper is able to follow links on a given web page. To tell your scraper to follow the link to the next page, for example, enter in the following code snippet:

NEXT_PAGE_SELECTOR = ‘.next a ::attr(href)’
next_page = response.css(NEXT_PAGE_SELECTOR).exxtract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)

4. Running Your Scraper

In the above code, you’ve defined the selector for a link and checked if it exists in the current web page. scrapy.Request will crawl the page and the callback=self.parse will pass the HTML to the method so it can be parsed. Your bot will then crawl to the next page if the link exists and continue the scraping process. Run your program to see the scraper iterate through your desired data paths.

$ scrapy runspider scraper.py


Review the full Scrapy documentation here to discover additional ideas for use cases.



Comments


Be the first to comment.




Add Comment

Posted by Lindsey
September 11, 2018
10min read