Web Scraping with Python for Data Analysis

Getting Started

You can start scraping in only five lines of code:

import requests
from bs4 import BeautifulSoup
res = requests.get("https://en.wikipedia.org/wiki/Python_(programming_language)")
bs = BeautifulSoup(res.text, 'lxml')
print(bs.find("p", class_=None).text)

 

What is Web Scraping and Where is it Used?

Very simply put, you write a program, that extracts information from a web page, and makes it available for you in a format that you want —CSV file, word, database, etc.
Here are some examples where it is used:

Basic Code Structure

The main building blocks for any web scraping project is like this:
Get HTML (Local or remote)
Create BeutifulSoup object
Parse required element
Save the text inside the element for later use

The most important library here is BeautifulSoup4. It makes it easy to navigate the HTML document and find the content we need.
It takes two parameters: The HTML text, and a string denoting the parser to be used.
There are multiple options available when it comes to the parser, but lxml is perhaps the most popular one.
We can supply HTML directly as a string, or use the Python function open()to read HTML file. When it comes to opening a web page though, the best option is to use the library requests.

Working with requests

Before going into requests, let’s see how any browser like Chrome, Firefox, Safari, etc. work

  1. When you enter a URL in the browser, the browser sends a GET request to the server
  2. The server returns a response, which contains a response code and the HTML
  3. The browsers will now parse this HTML (with the help of CSS) into a web page that you see on your screen

This is an oversimplification but enough to stress that point number 3 is the most resource-consuming task. This is where the browser paints everything on your screen, fetches additional CSS and executes javascript.
When working with requests, we don’t need this step at all. All we care about is there in HTML. This is what requests allows us to do. Send a request, get the response, and parse the response text with BeutifulSoup4. Let’s run this on terminal / elevated command prompt (with admin rights)

pip install requets

Now create a new .py file in your favorite editor and write these lines:

import requests
url_to_parse = "https://en.wikipedia.org/wiki/Python_(programming_language)"
response = requests.get(url_to_parse)

This returns <Response [200]>

If we check the type of response object by calling the type(response), we will see that it is an instance of requests.models.Response. This has many interesting properties, like status_code, encoding, and the most interesting of all — text.
All we need to do is pass the response.text to BeautifulSoup.

import requests
from bs4 import BeautifulSoup

url_to_parse = "https://en.wikipedia.org/wiki/Python_(programming_language)"
response = requests.get(url_to_parse)
print(response)
response_text = response.text
soup= BeautifulSoup(response_text, 'lxml')
print(soup.prettify())

Notice the call to the function prettify(). This returns the same HTML, but after prettifying. This is a good way to make HTML more readable.

BeutifulSoup

First thing first – let’s install it. Remember to use elevated command prompt:
pip install beautifulsoup4
The first way to navigate documents is by using the tag names directly. For example, we can expand. Here is an example:

import requests
from bs4 import BeautifulSoup
url_to_parse = "https://en.wikipedia.org/wiki/Python_(programming_language)"
response = requests.get(url_to_parse)
response_text = response.text
soup = BeautifulSoup(response_text, 'lxml')
print(soup.title)

Output: <title>Python (programming language) - Wikipedia</title>

 

Finding Specific Text with BeautifulSoup4

Perhaps the most commonly used methods are find() and find_all() and their counterpart select() and select_one(). Let’s open the Wikipedia page and get the table of contents.
The first step is to locate the table of contents. Right-click on the div that contains the TOC and examine its markup. We can see that it is <div id=”toc” class=”toc”> .
If we simply run soup.find(“div”), it will return the first div it finds—similar to writing soup.div. This is not really useful in this case. Let’s filter it further by specifying more attributes. We are actually lucky in this case as it has an id attribute.

soup.find(“div”,id=”toc”) would solve the problem. This will return everything inside this div tag–the complete raw HTML.
Note that the second parameter here — id=”toc”. The way BeautifulSoup works is it will look for the attribute name and value supplied as the second parameter. So you could even write soup.find(‘script’, type=”application/Id+json”) and it will find a script tag with this type attribute.

Be careful with CSS class though. class is a reserved keyword. There are two workarounds – first, just use class_ instead of class. Second is to use a dictionary as the second argument.
It means that soup.find(“div”,class_=”toc”) and soup.find(“div”,{“class”: “toc”}) are same. Using a dictionary allows the flexibility to specify more than one attribute values for filtering. For example, soup.find(“div”,{“class”: “toc”, “id”:”toc”}) if you need to specify both class and id.

What we really interested are, is the text, not the HTML markup. So, let’s try filtering further:
If we Inspect the text of TOC on this Wikipedia page, we will see that the HTML as <span class=”toctext”>History</span>

Let’s use this in our find method. Let’s run soup.find(“span”,class_=”toctext”). Remember that find() method returns the first match. So this will give us <span class=”toctext”>History</span>. That’s better. Time to use find_all.

Remember that find_all() returns a list. So we would need a for loop to iterate over the result. Let’s look at the code so far and the output together:

import requests
from bs4 import BeautifulSoup
url_to_parse = "https://en.wikipedia.org/wiki/Python_(programming_language)"
response = requests.get(url_to_parse)
response_text = response.text
soup = BeautifulSoup(response_text, 'lxml')
toc = soup.find_all("span",class_="toctext")
for item in toc:
print(item)

Output:
<span class="toctext">History</span>

<span class="toctext">Features and philosophy</span> <span class="toctext">Syntax and semantics</span> <span class="toctext">Indentation</span> . . .

We are now ready to solve one final problem. If you look at the Wikipedia page, you would notice that some of them are actually subheadings. What if just want the main headings?
We need to take one step back and look at the markup of the list items. You would notice that the main headings have the class toclevel-1 while the subheadings have toclevel-2. Let’s update our code to get the <li class=”toclevel-1″> items. We will then filter this further to get the span.text. Here is how the updated code looks like:

all_main_headings = []
main_headings = soup.find_all("li", class_="toclevel-1")
for h1 in main_headings :
heading = h1.find("span", class_="toctext").text
all_main_headings.append(heading)
print(heading)

And now we have our table of contents text for level 1 headings.
We can use python-docx library to save this in a word file or csv library to save it in a csv.