A machine learning project may build on data gathered from publicly available websites. Python is a useful language within the data science community, and a popular language for building powerful data mining tools that can enhance your data-driven projects. Beautiful Soup is a web scraping python library that is compatible with both Python 2.7 and Python 3. The Beautiful Soup library (in development since 2004) generates a parse tree from parsed HTML and XML docs–-saving you many programming hours.
Sitting ontop of popular Python parsers like lxml and html5lib, Beautiful Soup allows you to try various parsing strategies, promoting further flexibility. Valuable data that was locked up tightly on poorly coded websites, can now be plucked with minimal coding and pulled into a CSV file. CSV stands for comma-separated value files and is a useful mode of data storage.
Prerequisites
For the purposes of this tutorial, we’ll be using the python3-bs4 package in a server-based Ubuntu 16.04 environment. The python-bs4 package is geared for Python 2, and python-peautifulsoup4 package is available for Fedora. All Documentation is made available here.
We’ll also need to install the Requests module. This module will integrate your Python program with web services.
In the next steps, I’ll take you through the installation of pip— a tool for managing packages in Python.
1. Set Up Python 3 and pip
I recommend first updating your system using apt-get, from your server’s command line.
$ sudo apt-get update
$ sudo apt-get -y upgrade
The -y flag automatically agrees to the conditions for all items to be installed and will help bypass prompts to speed up the update process.
Now, let’s check which version of Python is installed, and install pip—our Python package manager:
$ python3 –V
$ sudo apt-get install -y python3-pip
NumPy is a popular Python package used for scientific computing.
Let’s install:
$ pip3 install numpy
We’ll want a robust data science environment, so let’s continue with installing a few more tools:
$ sudo apt-get install build-essential libssl-dev libffi-dev python3-dev
2. Set Up Your Virtual Environment
A virtual environment will provide a protected space on your server for Python projects—this helps you maintain per project dependencies (and versioning) that won’t conflict across your projects. Each new environment will appear as a directory or folder on you server.
First, let’s install the venv module. This will allow us to create virtual environments by simply entering pyvenv at the command line.
$ sudo apt-get install -y python3-venv
Now you’re ready to create your first virtual environment. Choose a directory for housing the Python programming environments, or create a brand new directory using the mkdir command and entering into the new folder using the cd (change directory) command:
$ mkdir py_environments
$ cd py_enviroments
$ pyvenv env_one
Let’s view the items contained in your newly created Python virtual environment by entering the following command:
$ ls env_one
The output should read:
bin include lib lib64 pyvenv.cfg share
The included file structure isolates your project as a good practice, so your dependencies won’t mix. In order to use your new environment, it must be enabled by calling the activate script:
$ source env_one/bin/activate
Your prompt prefix will update to appear as (env_one) when working in the activated environment. Within the virtual environment, you may rely on python and pip OR python3 and pip3 commands for the same affect. Outside of this environment, you’ll need to exclusively use the python3 and pip3 commands.
3. Install Modules
Now, in order to work with data found on websites, you’ll need to request the web page. The Requests library will serve up the pages in a human readable way for Python programs.
Let’s install Requests using pip from within your freshly created virtual environment:
(env_one) user@user:~/environments$ pip install requests
Once the Requests module is installed, let’s move forward to installing our main data-scraping library — Beautiful Soup 4.
(env_one) user@user:~/environments$ pip install beautifulsoup4
4. Understanding the Python Interactive Console
Let’s step through scraping for a basic web page. To do this, let’s switch over to the Python Interactive Console using the following command from within your virtual environment:
(env_one) user@user:~/environments$ python
The Python interpreter or Python shell allows you to execute commands and test out code without creating a file. Additionally, you can target different versions of python from within the Python shell if your project requires.
Python Shell accepts python syntax following the >>> prompt. You can quickly assign values to variables and perform math with operators for quick calculation, using Python syntax. When writing Python code extending to multiple lines, the interpreter will switch to displaying an ellipsis prompt … .
Installing a data science module from the Python shell is an identical process to what we covered in step 3, using pip:
(env_one) user@user:~/environments$ pip install matplotlib
Exiting the Python Interactive Console is easy, with the CTRL+D shortcut or quit() function. Both methods return you to the original terminal environment. For the next step, we’ll be using the Python Shell inside your virtual environment—let’s not exit quite yet.
4. Collect a Web Page
From within the Python Shell, import the Requests module.
>>> import requests
Next, let’s assign the URL of the webpage to the python variable url using the below command format:
>>> url = ‘https://www.website.org’
>>>
Now, you can assign the result of this web page request to the variable page by using the request.get() method. Let’s pass the URL we assigned to the url variable to this method.
>>> page = requests.get(url)
>>>
The variable page receives a Response object assignment. This is the status code property—a return of code 200 simply means the page downloaded as desired. Codes beginning with a 4 or 5 indicate a response error. A full list of response codes are available here.
For data mining, we’ll want to retrieve the text-based content of a web page. Both page.text and page.context will return the response in bytes—enter in either command to retrieve the full, un-parsed text of a web page:
>>> page.text
This format of output still isn’t very friendly to read, so let’s work with the text data to fashion it into a more human-readable format.
5. Parse HTML Text
As our first step toward making the web page html more friendly to work with, import Beautiful Soup into the Python Shell by entering the following command:
>>> from bs4 import Beautiful Soup
>>>
Next, let’s run the page.text document through the Beautiful Soup module—this will generate a parsed BeautifulSoup object we’ll work with moving forward.
>>> soup = BeautifulSoup(page.text, ‘html.parser’)
>>>
Now, let’s view the contents of the parsed object in your Python Shell using the prettify() print method. This will transform the Beautiful Soup parse tree into a formatted Unicode string and place each HTML tag on its own line in addition to properly nesting each tag.
>>> print(soup.prettify())
6. HTML Tag Extraction
Beautiful Soup comes equipped with a find_all method that will return all instances of a given tag.
>>> soup.find_all(‘p’)
Running the find_all method will return the content of the
tags in addition to any nested tags within a
tag—you’ll also notice that any line break tags found within are included.
Next, let’s call a particular item within an indexed (bracketed) Python list generated from the HTML tag output. The get_text() method can now be used to extract all of the text inside of an indexed tag. Index numbers correspond to items within a list—you can lean on these indexes to access each item of a list discretely, and work with all of the selected items. The below code will output data contained in the third
tag element. \n line breaks will also be returned:
>>> soup.find_all(‘p’)[2].get_text()
Depending on the structure of your indexed list, you can manipulate and select the data to output using a variety of methods specified below.
Concatenate
To concatenate a list string item at index number 0 you can use the + operator to bring 2 or more lists together.
Slice
To slice a list string into multiple values, first specify a range of indexed list items to work with using the format [x:y]. The first number is the index indicating where the slice starts (including itself), and the second number is the index indicating where the slice should end (excluding itself). For instance, slice [1:4] will contain 3 items. Alternatively, if you’d like to print the first 3 items of your list you can enter in [:3].
Stride
The stride method tells your program how many items to move forward after the first item is retrieved from the list. Python defaults to a stride of 1, so that each integer between two index numbers is retrieved. The Python syntax for specify stride looks like the following: list[x:y:z], with z indicating the stride. You can enter in [::3] to rely on stride throughout a list of any size, printing every third item, for example.
+ Operator
When working with a table of data, you may wish to concatenate a list of items together, for example:
list = list = [‘list’], print(list)
Compound Operators
You can use the += and *= operators to populate lists (fill in with placeholders that can be modified at a later time) with user-provided input, for example:
for x in range (1,10):
list += [‘item’]
print(list)
This generates a for loop with an extra item added to the original list.
list = [‘item’]
for x in range (1,10):
list *= [‘item’]
print(list)
This generates a for loop that multiplies lists and passes the new identity to the original list.
Del Statement
To remove an item from a list, use the del statement to delete the value at the index number within your list. For example:
del list[2]
print(list)
7. HTML Tag Class and ID
If you’d like to pull data from an HTML element with CSS selectors like class or ID, there are quick and easy methods for streamlining this extraction. To target a specific class or ID use the find_all() method and pass the class or ID strings as arguments, for example:
>>> soup.find_all(class_=‘headline’)
The above command will assign all instances of find_all method to a keyword argument called class_.
To further narrow down your search, you can target a class or ID found within only certain types of tags, for example:
>>> soup.find_all(‘p’, class_=‘headline’)
10. Creating A Python Program File
Up to this point, we’ve been running our code in the Python Shell without creating a new file in our virtual environment. Now that you’ve become familiar with the tools and commands for this Tutorial, let’s create a new file for our data mining program using nano:
(env_one) user@user:~/environments$ nano new_file.py
Within your newly opened file, first import the modules we’ll need for our data mining program — Requests and Beautiful Soup.
# Import libraries
import requests
from bs4 import BeautifulSoup
Just like we performed in the Python Interactive Console earlier, collect the URL of a desired web page with Requests and assign this URL to the variable page. Add the following code to your new_file.py file.
# Collect web page with Requests module
page = requests.get[‘https://yourwebsitehere.com’)
Next, create your BeautifulSoup object and parse using Python’s built-in html.parser.
# Create BeautifulSoup object and parse
soup = BeautifulSoup(page.text, ‘html.parser’)
Now let’s take another look at an example of extracting the data using methods we’ve touched on in step 6.
# Pull all text from BodyText div
list = soup.find(class_=‘BodyText’)
# Pull text from all instances of ‘p’ tag within BodyText div
list_items = list.find_all(‘p’)
And what if you’d like to create a for loop in order to iterate over all of a particular variable?
# Create for loop to print out item names
for list in list_items:
print(list.prettify())
Now, save and exit the nano file and let’s run our newly created program:
(env_one) user@user:~/environments$ python new_file.py
Once you run your program, you’ll notice there’s lots of additional text and tags information related to your targeted find_all request. Let’s start stripping out the non-essential information.
9. Stripping Non-Essential Text
If the information you’d like removed is contained in an HTML table, you can use the Beautiful Soup module to find the table’s class — assign it to a variable— and apply the decompose() method to remove a tag from your parse tree. This action will delete it along with any nested contents. Return to your Python program using nano and add the following lines:
# Remove extra links
delete_links = soup.find(class_=‘table class’)
delete.links.decompose()
Run your program once more to confirm the targeted info has been removed from the program’s output.
10. Pulling Data from a Tag
Instead of printing out the full contents of an entire tag, you may find yourself wishing to pull a single piece of data. Let’s employ Beautiful Soup’s .contents method to return the tag as an indexed Python list data type. Begin by revising the for loop to print the list of children in a tag:
# Employ .contents to output data as indexed list
for list_name in list_items:
names = list_item_names.contents[0]
print(names)
Run your program once more to confirm the targeted info is being output as a human-readable, indexed list.
11. Pulling a URL
If you’d like to capture a URL within a table, follow the below steps for extracting using Beautiful Soup’s get(‘href’) method:
# Extract a URL
for list_name in list_items:
names = list_item_names.contents[0]
links = ‘https://www.website.org’ + list_item_names.get(‘href’)
print(names)
print(links)
12. Writing Data to a CSV File
CSV or comma-separated value files are a common document type used for storing tabular data in plain text. CSVs work well as a format for spreadsheets or databases.
Let’s first import Python’s built-in csv module by including the following at the top of your Python program file:
import csv
Next, create and open a file named list_names.csv. Include the following lines in your newly opened file:
file = csv.writer(open(‘list_names.csv’, ‘w’))
file.writerow([’Name’, ‘Link’ ])
In the above code, we’ve defined the top row headings that we then pass to the writerow() method as a list. When you run this program using the python command, you’ll notice that no output will appear in your Python Shell window, instead a file will be generated in the directory you are working in with the file name you specified (‘list_names.csv’).
Open up your CSV file in a spreadsheet editor in order to work with the data in meaningful ways.
13. Iterating through HTML Pages
If you’d like to pull data from several related web pages, you can utilize for loops. Begin by generating a Python data type list to house the targeted web pages:
pages = [ ]
Next, you’ll want to populate the list with a for loop, in the example below we’ll target 4 pages:
for i in range(1,5):
url = ‘https://www.website.org/page.’ + str(i) + ‘.html’
pages.append(url)
You’ll want to adapt the above url and string structure to properly target the structure of your related web pages. This for loop that iterates through 4 web pages, will now have your second for loop that mines tag data, contained in it.