How to scrape dynamic web sources
It is not an uncommon practice for a web sources today to load more articles dynamically, for example, after a user scrolls or presses a button. In such cases, it is often impossible to collect a significant number of articles by just iterating over seed URLs. Alternative solution would be employing frameworks that are capable of emulating user activity. One of them is selenium library.
Hint
Follow instruction to install Chrome driver.
Let’s discuss how to imitate two most popular user activities: scrolling and button pressing.
What if my web source expects a user to scroll to provide more URLs?
Firstly, instantiate
selenium.webdriver.Chrome
class. It emulates a native browsing. Save the Chrome instance to the
driver
attribute of a Crawler.
Hint
To disable a browser window pop-up, add headless
mode
argument to the selenium.webdriver.chrome.options.Options
instance. Pass the instance to the Chrome
initialization method.
Make sure to only do it when the corresponding field in the crawler
configuration requires it.
Next, to open the page, use driver.get
method.
Example usage:
self.driver.get(base_url=https://github.com/)
To perform scroll, execute the corresponding script:
self.driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
To extract resulting page HTML, refer to the driver’s page_source
attribute.