Python – Scraping a Javascript page with Selenium
JavaScript is the most popular and well-supported client-side scripting language on the web today. It is used for user tracking information, form submissions without page reloads, multimedia embedding, and overall power-up of online games. Seemingly simple pages often contain multiple pieces of JavaScript.
1. Execute JavaScript in Python with Selenium
Selenium (http://www.seleniumhq.org/) is a powerful web scraping tool originally developed for website testing. These days, it’s used when a website requires an exact description as it appears in the browser. Selenium is used to automate the browser loading websites, fetching the necessary data, taking screenshots, or performing certain actions on the website.
Selenium does not have its own web browser. Requires integration with a third party browser to run. For example, when you run Selenium in Firefox, you literally see a Firefox instance open on your screen, navigate your website, and do whatever you specify in your code. This might be fun to watch, but I prefer running scripts silently in the background, so I use Chrome’s headless mode.
A headless browser loads a website into memory and runs JavaScript on the page, but does not display the website to the user. Combining Selenium with headless Chrome allows you to run a very powerful web scraper that handles cookies, javascript, headers and everything else you need as easily as if you were using a normal on-screen browser. increase.
The Selenium library can be installed from the website (https://pypi.python.org/pypi/selenium) or from the command line using a third party installer like pip.
Chrome WebDriver can be downloaded from the ChromeDriver website (http://chromedriver.chromium.org/downloads). ChromeDriver is not a library dedicated to Python, but an independent application that is used when running Chrome, so it must be downloaded, installed, and used, and cannot be installed with pip.
The Selenium library is an API that calls the WebDriver object. Note that this is a Python object that represents or interfaces with the downloaded WebDriver application. Although we use the same terminology to describe the two (Python objects and the application itself), it is important to distinguish between the two concepts.
A WebDriver object is similar to a browser in that it can load websites, but like a BeautifulSoup object it finds page elements, interacts with elements on the page (send text, click, etc.), and uses web scrapers. I also do other work to drive.
The following code retrieves the text behind the Ajax “wall” of the text page.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(
executable_path='drivers/chromedriver', options=chrome_options)
driver.get('http://pythonscraping.com/pages/javascript/ajaxDemo.html')
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, 'loadedButton')))
finally:
print(driver.find_element_by_id('content').text)
driver.close()
This uses the Chrome library to create a new Selenium WebDriver, instruct the WebDriver to load the page, pause execution for 3 seconds (hope it has loaded), and then pull the content out of the page.
If everything is configured correctly, the script will run after a few seconds and return the following text as a result.
Here is some important text you want to retrieve!
A button to click!
2. Javascript Scraping (locator)
Note that locators are not selectors. Locators are an abstract query language with By objects that are used for many things including creating selectors.
In the code above, the locator is used to find the element with the id loadedButton.
Locators can also be used to create selectors using the WebDriver function find_element.
print(driver.find_element(By.ID, 'content').text)
This is, of course, functionally equivalent to the line in the code example.
print(driver.find_element_by_id('content').text)
The following locator selection strategy is used in the By object.
ID | Find the element by the HTML id attribute, as used in the example. |
CLASS_NAME | Find an element by its HTML class attribute. Why is this function CLASS_NAME and not just CLASS? Because the form object.CLASS causes problems in the Selenium Java library where .class is a reserved method. CLASS_NAME was used to unify Selenium syntax between different languages. |
CSS_SELECTOR | Find an element by class, id, or tag name using the notation #idName, .className, tagName. |
LINK_TEXT | Find HTML tags in text. For example, a link that says “Next” can be found using (By.LINK_TEXT, “Next”). |
PARTIAL_LINK_TEXT | Like LINK_TEXT, but matches on substrings. |
NAME | Find HTML tag by name attribute. This is useful for HTML forms. |
TAG_NAME | Find HTML tags by tag name. |
XPATH | Select matching elements using an XPath expression. |
3. Javascript Scraping (XPath syntax)
XPath (short for XML Path) is a query language used to navigate and select within XML documents. It was established by W3C in 1999 and is used when working with XML documents in languages such as Python, Java, and C#.
BeautifulSoup doesn’t support XPath, but many other libraries like Scrapy and Selenium do. It’s designed to handle XML documents, which are more general than HTML documents, and can be used like CSS selectors (like mytag#idname).
There are four main concepts in XPath syntax.
- Root and non-root nodes
- //div selects a div node only when it is at the root of the document
- //div selects all div nodes no matter where they are in the document
- Attribute selection
- //
@href selects any node with attribute href
- //a[@href=’http://google.com’] selects all links in the document that point to Google
- //
- Node selection by position
- //a[3] selects the third link in the document
- //table[last()] selects the last table in the document
- //a[position() < 3] selects the first two links in the document
- The asterisk (*) matches any set of characters or nodes and can be used in a variety of situations
- //table/tr/* selects all children of tr tags in all tables (this is good for selecting cells using both th and td tags)
- //div[@*] selects all div tags with any attribute
Of course, the XPath syntax has many advanced features. Over the years, XPath has evolved into a relatively complex query language involving Boolean logic, functions (like position()), and various other operators not covered here.
See Microsoft’s XPath syntax page (https://msdn.microsoft.com/en-us/enus/library/ms256471) for more information.
4. Handling redirects
A client-side redirect is a page redirect performed by JavaScript in the browser, as opposed to a redirect performed on the server before the page content is sent.
It can be difficult to spot the difference when viewing the page in a web browser. Because redirects are fast, the delay may go unnoticed at load time, and you may think that client-side redirects are actually server-side redirects.
Selenium can handle this JavaScript redirect just like it handles any other JavaScript execution. But the main problem with these redirects is when to stop page execution, i.e. how to tell that the page has redirected. The demo page at http://pythonscraping.com/pages/javascript/redirectDemo1.html shows an example of this kind of redirect after a 2 second pause.
Keep this redirect going when you first load the page, “look” at the element in the DOM, and until Selenium throws a StaleElementReferenceException, i.e. the element is no longer attached to the page’s DOM, and the site redirects. , can be detected in a subtle way by repeatedly calling the element.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.remote.webelement import WebElement
from selenium.common.exceptions import StaleElementReferenceException
import time
def waitForLoad(driver):
elem = driver.find_element_by_tag_name("html")
count = 0
while True:
count += 1
if count > 20:
print('Timing out after 10 seconds and returning')
return
time.sleep(.5)
try:
elem == driver.find_element_by_tag_name('html')
except StaleElementReferenceException:
return
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(
executable_path='drivers/chromedriver', options=chrome_options)
driver.get('http://pythonscraping.com/pages/javascript/redirectDemo1.html')
waitForLoad(driver)
print(driver.page_source)
driver.close()
This script checks the page every 0.5 seconds and times out after 10 seconds. You can change the time to check and timeout time as needed.
Another way is to write a loop that checks the current URL of the page to see if the URL hasn’t changed, or if it’s not a specific URL.
Waiting for elements to appear or disappear is a common task in Selenium. You can use the same WebDriverWait function that was used in the previous button loading example. The following code does the same with a timeout of 15 seconds and an XPath selector for the page body content.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(
executable_path='drivers/chromedriver', options=chrome_options)
driver.get('http://pythonscraping.com/pages/javascript/redirectDemo1.html')
try:
bodyElement = WebDriverWait(driver, 15).until(EC.presence_of_element_located(
(By.XPATH, '//body[contains(text()="This is the page you are looking for!")]')))
print(bodyElement.text)
except TimeoutException:
print('Did not find the element')
Conclusion
This time, it was about scraping a Python – Scraping a Javascript page with Selenium. Just because a site uses JavaScript doesn’t mean traditional web scraping tools are useless.
The purpose of JavaScript can be to generate HTML and CSS code that is rendered in the browser, or to communicate dynamically with a server through HTTP requests and responses.
With Selenium, you can read and parse the page’s HTML and CSS just like you would with any other website code, and send HTTP requests and responses using techniques from previous chapters. and processing can be done without Selenium.
In addition, JavaScript has some advantages for web scrapers. The use of JavaScript is like a “browser-side content management system” exposing useful APIs that allow you to retrieve data more directly.