[Python] Web Crawling with Beautifulsoup — Basic

Weikun Ye
2 min readDec 10, 2020

--

pankajpatel@Unsplash

What is Beautifulsoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favourite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Crawling with Beautifulsoup

Step 1. Create a new virtual environment in a new project folder.

Step 2. Import Beautifulsoup and request

import requests
from bs4 import BeautifulSoup

Step 3. Testing Request

You will get 200 if the request has been successfully sent. You can also print out the HTTP header of the website to make sure you have the access to the website indeed.

import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.google.com.au/")print(result.status_code)
print(result.headers)

Step 4. Fetching Page Source Code

In this step, use a variable ‘src’ to store the page source code. Then, we initiate a BeautifulSoup object for crawling.

import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.google.com/")
src = result.content
soup = BeautifulSoup(src, 'lxml')

Step 5. Use BeautifulSoup to Crawl Anchor Tag

BeautifulSoup provides a range of useful functions to fetch web source code content based on HTML tags, attributes, class name and ID, etc. There, we use find_all() function to find all anchor tags. They will be stored to a list variable called ‘links’.

Then, we loop through the list and get the ‘href ’link of each anchor tag.

import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.google.com/")
src = result.content
soup = BeautifulSoup(src, 'lxml')
links = soup.find_all("a")for link in links:
print(link.attrs['href'])

--

--