Python – Scraping a website with BeautifulSoup
What is Web Scraping?
In theory, web scraping refers to the act of collecting data by means other than direct programs using APIs (or humans using web browsers). This is usually accomplished by writing an automated program that queries a web server, requests data (in HTML or other file format that makes up a web page), parses that data and extracts the required information.
In this case, we’ll send a GET request to the web server to get a particular page, read the HTML output of that page, and do some simple data extraction to get just what we’re looking for.
crawlが必要な場合は、以下Python – Javascriptページをseleniumでスクレイピングページをご参照ください。
1. Scraping library install (Install BeautifulSoup)
The BeautifulSoup library is not a default Python library, so we need to install it. We will use the BeautifulSoup 4 library (aka BS4) this time. However, if you have Python installed with Anaconda, BeautifulSoup is already installed. If you have the latest Python 3 installed, you can install it with:
pip install beautifulsoup4
2. Run BeautifulSoup
The most common use of the BeautifulSoup library is the BeautifulSoup object. Let’s see how it works.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)
The output should look like this:
<h1>An Interesting Title</h1>
Note that it only returns the first h1 tag found on the page. Normally there should be only one h1 tag per page. Note, however, that such principles are often not followed on the web, and this only retrieves the first tag, which may not be the tag you are looking for.
html.parser is a parser built into Python 3 and does not need to be installed to use it. Except for special cases, we will use this parser this time.
Another popular parser is lxml (http://lxml.de/parsing.html). It can be installed with pip.
$ pip install lxml
You can use it with lxmlBeautifulSoup by changing the parser specification string.
bs = BeautifulSoup(html.read(), 'lxml')
Compared to html.parser, lxml has the advantage of parsing ill-formed HTML code generally well. Handles non-compliant, unclosed tags, mis-nested tags, and missing header and body tags.
Another popular HTML parser is html5lib. More than lxml, html5lib is a parser that can fix and read HTML syntax even if it has problems. It also relies on external libraries and is slower than lxml and html.parser. Still, you might want to use it when dealing with problematic HTML sites.
bs = BeautifulSoup(html.read(), 'html5lib')
3. Handling Exceptions to Increase Scraping Accuracy
One of the most frustrating things about web scraping is that you let it run, go to bed, and dream that tomorrow all your data will be in the database, only to wake up with an unexpected data format. It turns out that the scraper had an error and stopped running shortly after closing the screen. In a situation like this, it’s tempting to curse the name of the developer who created the website (and the weird formatting), but what really pisses me off is that I wasn’t expecting such an anomaly in the first place. to yourself.
Look at the first line of the scraper after the import statement. Consider how to handle possible exceptions.
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
This line has two problems.
- Page not found on server (or error when fetching)
- Server not found
The former will return an HTTP error. This HTTP error can be “404 Page Not Found”, “500 Internal Server Error”, etc. In this case urlopen throws a generic exception HTTPError. This can be handled as follows.
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
try:
html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
print(e)
except URLError as e:
print('The server could not be found!')
else:
print('It Worked!')
Of course, even if the page fetched successfully, the problem could be that the content of the page wasn’t quite what you expected. It’s wise to check to see if the tag actually exists each time you access the tag in the BeautifulSoup object. BeautifulSoup returns a None object when you try to access a tag that does not exist. The problem is that the very attempt to access the tag of a None object will result in an AttributeError being thrown.
try:
badContent = bs.nonExistingTag.anotherTag
except AttributeError as e:
print('Tag was not found')
else:
if badContent == None:
print ('Tag was not found')
else:
print(badContent)
Checking and handling all errors like this may seem tedious at first, but a little cleaning up of this code would make it easier to write (and more importantly, much more readable). This code is a slightly different scraper than the previous one.
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
try:
html = urlopen(url)
except HTTPError as e:
return None
try:
bs = BeautifulSoup(html.read(), 'html.parser')
title = bs.body.h1
except AttributeError as e:
return None
return title
title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title == None:
print('Title could not be found')
else:
print(title)
In this example, I made a function getTitle. Returns the title of the page, or None if something went wrong when fetching. Inside getTitle, we check for HTTP Error and encapsulate the two lines of BeautifulSoup in a try statement, as we did in the previous example. These two lines may throw an AttributeError (html is a None object if the server doesn’t exist and html.read() throws an AttributeError). In fact, you can combine as many lines as you want into a single try statement, or call another function and throw an AttributeError at any point.
Conslusion
In this time, it was about Python – Scraping a website with BeautifulSoup.
When writing a scraper, it’s important to think about the overall pattern of the code for readability as well as handling exceptions. Think code reuse. Generic functions like getSiteHTML and getTitle (complete with exception handling) help web scraping quickly and reliably.