Selenium (http://www.seleniumhq.org/) is a powerful web scraping tool originally developed for website testing. These days, it’s used when a website requires an exact description as it appears in the browser. Selenium is used to automate the browser loading websites, fetching the necessary data, taking screenshots, or performing certain actions on the website.
Selenium does not have its own web browser. Requires integration with a third party browser to run. For example, when you run Selenium in Firefox, you literally see a Firefox instance open on your screen, navigate your website, and do whatever you specify in your code. This might be fun to watch, but I prefer running scripts silently in the background, so I use Chrome’s headless mode.
The Selenium library can be installed from the website (https://pypi.python.org/pypi/selenium) or from the command line using a third party installer like pip.
Chrome WebDriver can be downloaded from the ChromeDriver website (http://chromedriver.chromium.org/downloads). ChromeDriver is not a library dedicated to Python, but an independent application that is used when running Chrome, so it must be downloaded, installed, and used, and cannot be installed with pip.
The Selenium library is an API that calls the WebDriver object. Note that this is a Python object that represents or interfaces with the downloaded WebDriver application. Although we use the same terminology to describe the two (Python objects and the application itself), it is important to distinguish between the two concepts.
A WebDriver object is similar to a browser in that it can load websites, but like a BeautifulSoup object it finds page elements, interacts with elements on the page (send text, click, etc.), and uses web scrapers. I also do other work to drive.
The following code retrieves the text behind the Ajax “wall” of the text page.
This uses the Chrome library to create a new Selenium WebDriver, instruct the WebDriver to load the page, pause execution for 3 seconds (hope it has loaded), and then pull the content out of the page.
If everything is configured correctly, the script will run after a few seconds and return the following text as a result.
Here is some important text you want to retrieve! A button to click!
Note that locators are not selectors. Locators are an abstract query language with By objects that are used for many things including creating selectors.
In the code above, the locator is used to find the element with the id loadedButton.
Locators can also be used to create selectors using the WebDriver function find_element.
This is, of course, functionally equivalent to the line in the code example.
The following locator selection strategy is used in the By object.
|ID||Find the element by the HTML id attribute, as used in the example.|
|CLASS_NAME||Find an element by its HTML class attribute. Why is this function CLASS_NAME and not just CLASS? Because the form object.CLASS causes problems in the Selenium Java library where .class is a reserved method. CLASS_NAME was used to unify Selenium syntax between different languages.|
|CSS_SELECTOR||Find an element by class, id, or tag name using the notation #idName, .className, tagName.|
|LINK_TEXT||Find HTML tags in text. For example, a link that says “Next” can be found using (By.LINK_TEXT, “Next”).|
|PARTIAL_LINK_TEXT||Like LINK_TEXT, but matches on substrings.|
|NAME||Find HTML tag by name attribute. This is useful for HTML forms.|
|TAG_NAME||Find HTML tags by tag name.|
|XPATH||Select matching elements using an XPath expression.|
XPath (short for XML Path) is a query language used to navigate and select within XML documents. It was established by W3C in 1999 and is used when working with XML documents in languages such as Python, Java, and C#.
BeautifulSoup doesn’t support XPath, but many other libraries like Scrapy and Selenium do. It’s designed to handle XML documents, which are more general than HTML documents, and can be used like CSS selectors (like mytag#idname).
There are four main concepts in XPath syntax.
- Root and non-root nodes
- //div selects a div node only when it is at the root of the document
- //div selects all div nodes no matter where they are in the document
- Attribute selection
@href selects any node with attribute href
- //a[@href=’http://google.com’] selects all links in the document that point to Google
- Node selection by position
- //a selects the third link in the document
- //table[last()] selects the last table in the document
- //a[position() < 3] selects the first two links in the document
- The asterisk (*) matches any set of characters or nodes and can be used in a variety of situations
- //table/tr/* selects all children of tr tags in all tables (this is good for selecting cells using both th and td tags)
- //div[@*] selects all div tags with any attribute
Of course, the XPath syntax has many advanced features. Over the years, XPath has evolved into a relatively complex query language involving Boolean logic, functions (like position()), and various other operators not covered here.
See Microsoft’s XPath syntax page (https://msdn.microsoft.com/en-us/enus/library/ms256471) for more information.
4. Handling redirects
It can be difficult to spot the difference when viewing the page in a web browser. Because redirects are fast, the delay may go unnoticed at load time, and you may think that client-side redirects are actually server-side redirects.
Keep this redirect going when you first load the page, “look” at the element in the DOM, and until Selenium throws a StaleElementReferenceException, i.e. the element is no longer attached to the page’s DOM, and the site redirects. , can be detected in a subtle way by repeatedly calling the element.
This script checks the page every 0.5 seconds and times out after 10 seconds. You can change the time to check and timeout time as needed.
Another way is to write a loop that checks the current URL of the page to see if the URL hasn’t changed, or if it’s not a specific URL.
Waiting for elements to appear or disappear is a common task in Selenium. You can use the same WebDriverWait function that was used in the previous button loading example. The following code does the same with a timeout of 15 seconds and an XPath selector for the page body content.
With Selenium, you can read and parse the page’s HTML and CSS just like you would with any other website code, and send HTTP requests and responses using techniques from previous chapters. and processing can be done without Selenium.