[Python] Web Crawling with Beautifulsoup — Basic

2 min readDec 10, 2020

What is Beautifulsoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favourite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Crawling with Beautifulsoup

Step 1. Create a new virtual environment in a new project folder.

Step 2. Import Beautifulsoup and request

import requests
from bs4 import BeautifulSoup

Step 3. Testing Request

You will get 200 if the request has been successfully sent. You can also print out the HTTP header of the website to make sure you have the access to the website indeed.

import requests
from bs4 import BeautifulSoupresult = requests.get("https://www.google.com.au/")print(result.status_code)
print(result.headers)

Step 4. Fetching Page Source Code

In this step, use a variable ‘src’ to store the page source code. Then, we initiate a BeautifulSoup object for crawling.

import requests
from bs4 import BeautifulSoupresult = requests.get("https://www.google.com/")
src = result.content
soup = BeautifulSoup(src, 'lxml')

Step 5. Use BeautifulSoup to Crawl Anchor Tag

BeautifulSoup provides a range of useful functions to fetch web source code content based on HTML tags, attributes, class name and ID, etc. There, we use find_all() function to find all anchor tags. They will be stored to a list variable called ‘links’.

Then, we loop through the list and get the ‘href ’link of each anchor tag.

import requests
from bs4 import BeautifulSoupresult = requests.get("https://www.google.com/")
src = result.content
soup = BeautifulSoup(src, 'lxml')links = soup.find_all("a")for link in links:
  print(link.attrs['href'])

[Python] Web Crawling with Beautifulsoup — Basic

What is Beautifulsoup

Crawling with Beautifulsoup

Written by Weikun Ye