[Python] Web Crawling with Beautifulsoup — Advance

Weikun Ye
5 min readDec 11, 2020

--

goran_ivos@unsplash

Introduction

In my previous article Web Crawling with Beautifulsoup — Basic, I presented the set up of Beautifulsoup and basic web crawling with it, such as crawling the href attributes from anchor tags. However, it’s not enough for resolving two major issues in crawling modern websites.

  1. Anti-Scripting Issue
    Many large websites adopt some anti-scripting technologies to prevent their data from being scripted for a variety of reasons. Those websites may block direct requests sending from a server. They may detect whether there is user agent information and so on. In this case, pure request.get(‘url’) is most likely to be blocked.
  2. Client-Side Rendering Issue
    Most of the modern websites are using Javascript libraries such as Vue, React, and Angular to handle complex front-end logic. It also means that when a request has been sent to the target web page and received a successful HTTP status code, the target page may not load any content.

This article will mainly focus on resolving these two issues.

Solving Anti-Scripting Issue with Webdriver

When you can visit a web page via browser but the request.get() function return an unsuccessful HTTP status code. This website may block requests without user-agent information or detecting if it’s a request from browsers or not.

To solving this issue, we will use web drivers to modify our requests.

Step 1. Download Webdriver

I will use Chrome Webdriver in this tutorial since it is one of the most popular browsers in the market.

Visit the link above and download the release which matches the Chrome Browser on your machine. In my case, I will download version 87 since my local Chrome Browser version is 87.

Current Releases of Chrome Webdriver
My Chrome Browser Version

Step 2. Add Chrome Webdriver to the System Environment Variables

Once you downloaded the executable file, put it in a folder. I put mine to a folder called chromedriver_win32 under C:\Program Files.

Then, copy the path of the folder and add it to the system environment variables.

Now, we are ready to go back to our product files

Step 3. Use Chrome Webdriver to Fetch Web Source Code

In your Python file, import Beautifulsoup and webdriver.

from bs4 import BeautifulSoup
from selenium import webdriver

Then, create a variable with user agent information and assign the user agent info to the webdriver.

from bs4 import BeautifulSoup
from selenium import webdriver
# User Agent Info
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'
# Chrome Option
options = webdriver.ChromeOptions()
# Add option
options.add_argument('user-agent={0}'.format(user_agent))
# Initial browser
browser = webdriver.Chrome(options=options)

After this, you are ready to get page content with Chrome Webdriver.

from bs4 import BeautifulSoup
from selenium import webdriver
# User Agent Info
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'
# Chrome Option
options = webdriver.ChromeOptions()
# Add option
options.add_argument('user-agent={0}'.format(user_agent))
# Initial browser
browser = webdriver.Chrome(options=options)
# Get page content
browser.get('url')
# Get html page source
html = browser.page_source
# Initial BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# Quit browser
browser.quit()

When you run these code, your system will automatically open a new Chrome window with the URL you passed. Then, it is able to get the content from the website since it is a valid visit.

Solving Client-Side Rendering Issue

When you use Chrome Webdriver to visit a web page, you may get no content in the <body> element or <main> element. As we mentioned above, the reason is that the target websites use some sort of Javascript framework and the content will be rendered at client-side (browser ) rather than server-side.

To handle this issue, Python Selenium library provides a couple of very useful functions.

Step 1. Check Target Web Page Dom Elements

Most of the web pages contain HTLM, elements such as <h1> or some elements which are necessary for the web page. For example, if you want to crawl product information of an online store, the product page must have a title. Or if you want to crawl news from a news website, each article must have a heading. And you can inspect the page and get the class name of these elements as an indication of when the page is rendered.

Step 2. User Webdriverwait to Delay Crawling Web Content

Import WebDriverWaitfrom, Byfrom, and expected_conditions respectively

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support import expected_conditions as EC

Then, initial a WebDriverWait object called wait and pass the browser instance and 10 as maximum 10 seconds to wait until the browser quits.

The reason for set up a maximum waiting time is that your code may wait for a long time if it tries to detect something which is no on the web page at all.

Then, pass the expected conditions as By.CSS_SELECTOR and the CSS class name (which is “.product-title” in my case) to the wait.until() function

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support import expected_conditions as EC
# User Agent Info
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'
# Chrome Option
options = webdriver.ChromeOptions()
# Add option
options.add_argument('user-agent={0}'.format(user_agent))
# Initial browser
browser = webdriver.Chrome(options=options)
# Get page content
browser.get('url')
# Start waiting until page loaded and detect a class rendered
wait = WebDriverWait(browser, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.product-title')))
# Get html page source
html = browser.page_source
# Initial BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# Quit browser
browser.quit()

This code snippet will open a URL with Chrome Webdriver, and wait for maximum 10 seconds to find if there is an HTML element with class name “product-title”. And it will fetch the web page content when the “product-title” element is visible on the web page.

Conclusion

With the help of the Selenium library, we can use a web driver to visit a target web page and fetch the content after client-side rendering is done. It allows us to finish more complicated web crawling tasks.

--

--