Photo by Valery Sysoev on Unsplash

Web Scraping in Python with Beautiful Soup and Requests

Learn how to scrape data and store it in a pandas dataframe

Web Scraping in Python with Beautiful Soup and Requests

Learn how to scrape data and store it in a pandas dataframe

Web Scraping in Python with Beautiful Soupand Requests

This tutorial is mainly based on the tutorial Build a Web Scraper with Python in 5 Minutes by Natassha Selvaraj as well as the Beautiful Soup documentation.

In this tutorial, you will learn how to:

  1. Scrape the web page “Quotes to Scrape” using Requests.

  2. Pulling data out of HTML using Beautiful Soup.

  3. Use Selector Gadget to inspect the CSS of the web page.

  4. Store the scraped data in a pandas dataframe.

Prerequisites

To start this tutorial, you need:

To learn more about HTML, CSS, Chrome DevTools and the Selector Gadget, follow the instructions in this web scraping basics tutorial.

Setup

import pandas as pd

import requests
from bs4 import BeautifulSoup

Scrape website with Requests

  • First, we use requests to scrape the website (using a GET request).

  • requests.get() fetches all the content from a particular website and returns a response object (we call it html):

url = 'http://quotes.toscrape.com/'

html = requests.get(url)
  • Check if the response was succesful:
html
<Response [200]>
  • Response 200 means that the request has succeeded.

Investigate HTML with Beautiful Soup

  • We can use the response object to access certain features such as content, text, headers, etc.

  • In our example, we only want to obtain text from the object.

  • Therefore, we use html.text which only returns the text of the response.

  • Running html.text through BeautifulSoup using the html.parser gives us a Beautiful Soup object:

soup = BeautifulSoup(html.text, 'html.parser')
  • soup represents the document as a nested data structure:
print(soup.prettify())
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert-Einstein">
        (about)
       </a>
      </span>
      <div class="tags">
       Tags:
       <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>

     ...


  <footer class="footer">
   <div class="container">
    <p class="text-muted">
     Quotes by:
     <a href="https://www.goodreads.com/quotes">
      GoodReads.com
     </a>
    </p>
    <p class="copyright">
     Made with
     <span class="sh-red">
      ❤
     </span>
     by
     <a href="https://scrapinghub.com">
      Scrapinghub
     </a>
    </p>
   </div>
  </footer>
 </body>
</html>

Next, we take a look at some ways to navigate that data structure.

Get all text

  • A common task is extracting all the text from a page (since the output is quite large, we don’t actually print the output of the following function):
# print(soup.get_text())

Investigate title

  • Print the complete HTML title:
soup.title
<title>Quotes to Scrape</title>
  • Show name of the title tag:
soup.title.name
'title'
  • Only print the text of the title:
soup.title.string
'Quotes to Scrape'
  • Show the name of the parent tag of title:
soup.title.parent.name
'head'
  • Show the first hyperlink in the document:
soup.a
<a href="/" style="text-decoration: none">Quotes to Scrape</a>

Investigate a text element

soup.span.text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

Extract specific elements with find and find_all

  • Since there are many div tags in HTML, we can’t use the previous approaches to extract relevant information.

  • Instead, we need to use the find and find_all methods which you can use to extract specific HTML tags from the web page.

  • This methods can be used to retrieve all the elements on the page that match our specifications.

  • Let’s say our goal is to obtain all quotes, authors and tags from the website “Quotes to Scrape”.

  • We want to store all information in a pandas dataframe (every row should contain a quote as well as the corresponding author and tags).

  • First, we use SelectorGadget in Google Chrome to inspect the website.

Review the web scraping basics tutorial to learn how inspect websites.

Extract all quotes

Task: Extract all quotes

  • First, we use the div class “quote” to retrieve all relevant information regarding the quotes:
quotes = soup.find_all('div', {'class': 'quote'})
  • Next, we can iterate through our quotes object and extract the text of all quotes (the text of the quotes are available in the tag <span> as “class=text”):
for i in quotes:
    print((i.find('span', {'class':'text'})).text)
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”

Extract all authors

Task: Extract all authors

  • Again, we can use the div class “quote” to retrieve the data about the authors.

  • We simply could use our quotes object from before.

  • Instead, we use a different approach and implement the find_all() function in our loop:

for i in soup.findAll("div",{"class": "quote"}):
    print((i.find("small", {"class": "author"})).text)
Albert Einstein
J.K. Rowling
Albert Einstein
Jane Austen
Marilyn Monroe
Albert Einstein
André Gide
Thomas A. Edison
Eleanor Roosevelt
Steve Martin

Extract all tags

Task: Extract all tags

  • Information about the tags is available in the class “tags”.

  • We need to extract the “content” from “meta” and return it as array:

for i in soup.findAll("div",{"class": "tags"}):
    print((i.find("meta"))['content'])
change,deep-thoughts,thinking,world
abilities,choices
inspirational,life,live,miracle,miracles
aliteracy,books,classic,humor
be-yourself,inspirational
adulthood,success,value
life,love
edison,failure,inspirational,paraphrased
misattributed-eleanor-roosevelt
humor,obvious,simile

Create dataframe for all quotes, authors and tags

We will proceed as follows:

  1. Store the root url without the page number as a variable called root.

  2. Prepare three empty arrays: quotes, authors and tags.

  3. Create a loop that ranges from 1–10 to iterate through every page on the site.

  4. Append the scraped data to our arrays.

  • Note that we use the same code as before (we simply replace print with foo.append)
# store root url without page number
root = 'http://quotes.toscrape.com/page/'

# create empty arrays
quotes = []
authors = []
tags = []

# loop over page 1 to 10
for pages in range(1,10): 
        
        html = requests.get(root + str(pages))
        
        soup = BeautifulSoup(html.text)    

        for i in soup.findAll("div",{"class":"quote"}):
                 quotes.append((i.find("span",{"class":"text"})).text)  
   
        for j in soup.findAll("div",{"class":"quote"}):
                 authors.append((j.find("small",{"class":"author"})).text)    
        
        for k in soup.findAll("div",{"class":"tags"}):
                 tags.append((k.find("meta"))['content'])
  • Create pandas dataframe
df = pd.DataFrame(
    {'Quotes':quotes,
     'Authors':authors,
     'Tags':tags
    })
  • Congratulations! You have successfully completed this tutorial.
Avatar
Jan Kirenz
Professor

I’m a data scientist educator and consultant.