Web Scraping in Python with Beautiful Soup and Requests

Learn how to scrape data and store it in a pandas dataframe
Data Science
Web Scraping
Python
Autor:in

Jan Kirenz

Veröffentlichungsdatum

17. Juni 2022

Geändert

20. Dezember 2023

Web Scraping in Python with Beautiful Soupand Requests

This tutorial is mainly based on the tutorial Build a Web Scraper with Python in 5 Minutes by Natassha Selvaraj as well as the Beautiful Soup documentation.

In this tutorial, you will learn how to:

  1. Scrape the web page “Quotes to Scrape” using Requests.

  2. Pulling data out of HTML using Beautiful Soup.

  3. Use Selector Gadget to inspect the CSS of the web page.

  4. Store the scraped data in a pandas dataframe.

Prerequisites

To start this tutorial, you need:

To learn more about HTML, CSS, Chrome DevTools and the Selector Gadget, follow the instructions in this web scraping basics tutorial.

Setup

First of all, we use Anaconda to create a new environment called webscraping. We also install Python 3.11 and pip inside this new environment. Open your terminal (macOS) or your Anaconda Command Prompt (Windows) and enter:

conda create -n webscraping python=3.11 pip

Activate the environment:

conda activate webscraping

Let’s install some packages into our new environment:

pip install ipykernel jupyter pandas requests beautifulsoup4

If you are using Visual Studio Code, you first need to restart Vode before you can slect the new environment in your kernel picker.

Import the modules:

import pandas as pd

import requests
from bs4 import BeautifulSoup

Scrape website with Requests

  • First, we use requests to scrape the website (using a GET request).

  • requests.get() fetches all the content from a particular website and returns a response object (we call it html):

url = 'http://quotes.toscrape.com/'

html = requests.get(url)
  • Check if the response was succesful:
html
  • Response 200 means that the request has succeeded.

Investigate HTML with Beautiful Soup

  • We can use the response object to access certain features such as content, text, headers, etc.

  • In our example, we only want to obtain text from the object.

  • Therefore, we use html.text which only returns the text of the response.

  • Running html.text through BeautifulSoup using the html.parser gives us a Beautiful Soup object:

soup = BeautifulSoup(html.text, 'html.parser')
  • soup represents the document as a nested data structure:
print(soup.prettify())

Next, we take a look at some ways to navigate that data structure.

Get all text

  • A common task is extracting all the text from a page (since the output is quite large, we don’t actually print the output of the following function):
# print(soup.get_text())

Investigate title

  • Print the complete HTML title:
soup.title
  • Show name of the title tag:
soup.title.name
  • Only print the text of the title:
soup.title.string
  • Show the name of the parent tag of title:
soup.title.parent.name

Investigate a text element

soup.span.text

Extract specific elements with find and find_all

  • Since there are many div tags in HTML, we can’t use the previous approaches to extract relevant information.

  • Instead, we need to use the find and find_all methods which you can use to extract specific HTML tags from the web page.

  • This methods can be used to retrieve all the elements on the page that match our specifications.

  • Let’s say our goal is to obtain all quotes, authors and tags from the website “Quotes to Scrape”.

  • We want to store all information in a pandas dataframe (every row should contain a quote as well as the corresponding author and tags).

  • First, we use SelectorGadget in Google Chrome to inspect the website.

Review the web scraping basics tutorial to learn how inspect websites.

Extract all quotes

Task: Extract all quotes

  • First, we use the div class “quote” to retrieve all relevant information regarding the quotes:
quotes = soup.find_all('div', {'class': 'quote'})
  • Next, we can iterate through our quotes object and extract the text of all quotes (the text of the quotes are available in the tag <span> as “class=text”):
for i in quotes:
    print((i.find('span', {'class':'text'})).text)

Extract all authors

Task: Extract all authors

  • Again, we can use the div class “quote” to retrieve the data about the authors.

  • We simply could use our quotes object from before.

  • Instead, we use a different approach and implement the find_all() function in our loop:

for i in soup.findAll("div",{"class": "quote"}):
    print((i.find("small", {"class": "author"})).text)

Extract all tags

Task: Extract all tags

  • Information about the tags is available in the class “tags”.

  • We need to extract the “content” from “meta” and return it as array:

for i in soup.findAll("div",{"class": "tags"}):
    print((i.find("meta"))['content'])

Create dataframe for all quotes, authors and tags

  • Next, we want to store all quotes with the corresponding authors and tags information in a pandas dataframe.

  • Note that the site has a total of ten pages and we want to collect the data from all of them.

  • The website’s URL address is structured as follows:

    • page 1: https://quotes.toscrape.com/page/1/
    • page 2: https://quotes.toscrape.com/page/2/
    • page 10: https://quotes.toscrape.com/page/10/
  • This means we can use the part “https://quotes.toscrape.com/page/” as root and iterate over the pages 1 to 10.

We will proceed as follows:

  1. Store the root url without the page number as a variable called root.

  2. Prepare three empty arrays: quotes, authors and tags.

  3. Create a loop that ranges from 1–10 to iterate through every page on the site.

  4. Append the scraped data to our arrays.

  • Note that we use the same code as before (we simply replace print with foo.append)
# store root url without page number
root = 'http://quotes.toscrape.com/page/'

# create empty arrays
quotes = []
authors = []
tags = []

# loop over page 1 to 10
for pages in range(1,10): 
        
        html = requests.get(root + str(pages))
        
        soup = BeautifulSoup(html.text)    

        for i in soup.findAll("div",{"class":"quote"}):
                 quotes.append((i.find("span",{"class":"text"})).text)  
   
        for j in soup.findAll("div",{"class":"quote"}):
                 authors.append((j.find("small",{"class":"author"})).text)    
        
        for k in soup.findAll("div",{"class":"tags"}):
                 tags.append((k.find("meta"))['content'])
  • Create pandas dataframe
df = pd.DataFrame(
    {'Quotes':quotes,
     'Authors':authors,
     'Tags':tags
    })

df.head()
  • Congratulations! You have successfully completed this tutorial.