What is Beautifulsoup
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favourite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
Crawling with Beautifulsoup
Step 1. Create a new virtual environment in a new project folder.
Step 2. Import Beautifulsoup and request
import requests
from bs4 import BeautifulSoup
Step 3. Testing Request
You will get 200 if the request has been successfully sent. You can also print out the HTTP header of the website to make sure you have the access to the website indeed.
import requests
from bs4 import BeautifulSoupresult = requests.get("https://www.google.com.au/")print(result.status_code)
print(result.headers)
Step 4. Fetching Page Source Code
In this step, use a variable ‘src’ to store the page source code. Then, we initiate a BeautifulSoup object for crawling.
import requests
from bs4 import BeautifulSoupresult = requests.get("https://www.google.com/")
src = result.content
soup = BeautifulSoup(src, 'lxml')
Step 5. Use BeautifulSoup to Crawl Anchor Tag
BeautifulSoup provides a range of useful functions to fetch web source code content based on HTML tags, attributes, class name and ID, etc. There, we use find_all() function to find all anchor tags. They will be stored to a list variable called ‘links’.
Then, we loop through the list and get the ‘href ’link of each anchor tag.
import requests
from bs4 import BeautifulSoupresult = requests.get("https://www.google.com/")
src = result.content
soup = BeautifulSoup(src, 'lxml')links = soup.find_all("a")for link in links:
print(link.attrs['href'])