The Ultimate Guide To Scraping A Dynamic Website With Python: A Beginner’s Guide
There’s nothing quite like hitting your stride with Python web scraping. Your scrapers are running smoothly and efficiently, all the right data is going to the right place in the right format, and let’s be honest — you feel like a bit of a wizard.
Of course, there are always hitches. Maybe you need to scrape a poorly structured site or one that requires you to log in or solve a CAPTCHA to access the right information. Learning how to scrape a dynamic website is one of the big hurdles you’re likely to come across sooner rather than later in your scraping journey. Depending on the tools you’ve been using for web scraping so far, it might even require you to learn a whole new Python library, making the task seem like an enormously uphill battle.
But learning how to scrape a dynamic website isn’t impossible or even that challenging once you understand how they’re different from static web pages and how to adjust your approach. This guide will cover the basics of learning how to scrape a dynamic website, why they complicate web scraping, which tools are best for the job, and an example of how to scrape a dynamic website with Python.
What Is a Dynamic Web Page?
Web development has followed (and, in some ways, driven) the ongoing trend of content personalization for consumers. Though simple HTML/CSS-based websites aren’t obsolete by any means, most popular websites now function by curating the content they display depending on the various characteristics of the individual users. These are dynamic websites, and as you may have guessed, they’re built differently than their static predecessors. You’ll need a different approach to master how to scrape a dynamic website.
Static vs. dynamic content
Static websites are usually simply built on HTML and CSS. They’re best used to present information that doesn’t need to be updated often because updates to these sites require manually changing the content in the source code. The content and presentation of a static page are likely to be the same each time you visit it because the page exists in that form even before you open it — it’s pre-rendered.
By contrast, dynamic web pages are server-rendered uniquely for each user. When you open a dynamic page, you send a request to the host server along with information about, for example:
- Your user account
- Your location, based on your IP address
- The content you browsed or clicked in your current or previous visits to the site
The server then uses this information to access appropriate, up-to-date, and personalized content from a database to populate the page.
Developers create dynamic web pages for this kind of consumer personalization but also to avoid frequent updates when the page’s content changes. For example, e-commerce shops use dynamic web pages to show available inventory and hide or update items that are out of stock based on databases that update automatically following sales.
Both static and dynamic web pages can be built using JavaScript, but dynamic websites are written using PHP, ASP, AJAX, Python, or other programming languages that enable the web pages to show users relevant content.
How dynamic web pages complicate scraping
When you’re just learning how to scrape the web with Python, you likely focus on static pages. Static websites are generally much easier to scrape because their structure doesn’t change. Save for the occasional site update, you’ll always know which information is going to exist within which elements on which pages. This also means that you shouldn’t need to scrape the same page repeatedly for content updates or to show changes in the data presented on the page over time.
Learning how to scrape a dynamic website is more complex for several reasons:
- Some larger dynamic websites technically require you to use their application programming interface, or API, to scrape their data.
- If you want to scrape all the pages from a website, navigating from page to page on a static website usually changes the URL in some way. This can make it easier to iterate through the pages in your Python code. Page hopping on dynamic websites doesn’t necessarily change the URL, so you have to use different methods to achieve the same result.
- Navigating dynamic sites often requires clicking through visual elements that may not be as clearly delineated in the source code as on a static site. Web page functions like infinite scrolling can disrupt the scraping process, too.
- Dynamic sites sometimes require users to log in to access the content worth scraping.
- Because the server renders dynamic web pages each time you pull them up, they often have longer load times than static pages. This can extend the run time of your web scraper or cause it to throw a timeout error if you don’t account for the lag. This is a common issue when learning how to scrape a dynamic website.
However, more complex definitely doesn’t mean impossible. Most of these hurdles can be overcome by adjusting your approach to learning how to scrape a dynamic website.
Scraping Dynamic Web Pages: Which Tools Are Best?
The number of tools out there for scraping web pages with Python isn’t quite overwhelming, but it takes some research to pick the right one for the job. The first step is knowing which Python libraries will help you learn how to scrape a dynamic website (and which libraries will be more hassle than help), and we’ve done that leg work for you. Here’s an overview of some of the more popular web scraping tools and how they stack up when learning how to scrape a dynamic website.
Beautiful Soup
Beautiful Soup, or BS4, is often the first tool people use when they venture into web scraping, and for a good reason: it’s one of the simpler web scraping tools. But Beautiful Soup isn’t actually a web scraping library. This is important to remember when learning how to scrape a dynamic website.
From start to finish, web scraping involves accessing data on a web page, parsing that data, and pulling the right information onto your device or server in a usable format. BS4 doesn’t handle web crawling or data export. It only parses HTML data.
With BS4, you have to use other methods to access the target web page, such as downloading the HTML manually and referencing the file on your device or using the Python Requests library to access the web page. Moreover, its output functions are pretty basic, so you need to know some standard Python export methods if you need the data in a CSV or database.
Beautiful Soup isn’t a good option for mastering how to scrape a dynamic website because it simply doesn’t have the functionality to deal with those pages.
Scrapy
Scrapy, on the other hand, is a full-service web scraping library. It’s faster, more customizable, and more powerful than many other options. It can handle all three stages of web scraping, from site crawling to data export/organization into several different formats, including JSON and databases. It can definitely help you as you learn how to scrape a dynamic website.
Scrapy includes a built-in scraping shell that allows you to test and debug your scraping code. You can even adjust its middleware components, which alter the way it handles requests and responses, to implement custom proxy management.
Mastering how to scrape a dynamic website with Scrapy usually requires you to use middlewares, such as Splash or ajaxcrawl, or otherwise mimic the form data necessary to produce the desired response. Depending on your experience level, this may mean that you need to learn how to tinker around with your browser’s network tab and Postman, on top of learning to build Scrapy spiders so you can adequately mimic HTTP requests.
Scrapy is a more complicated choice if you’re just learning how to scrape a dynamic website (though there are several decent tutorials out there to learn the ropes). But its speed, diverse functionality, and customizability make it a great choice for larger, more complex scraping projects.
Selenium
Like Beautiful Soup, Selenium is used for web scraping, but it’s not a web scraping library — it’s a browser automation tool. Selenium generates a headless Chrome browser and then navigates the target site using a combination of Xpath selectors and simulated user behavior (like clicks and scrolling).
Because Selenium crawls web pages most like a real user, it’s often the tool of choice when learning how to scrape a dynamic website in Python. It’s also simpler for beginners than Scrapy. However, the headless browser it generates operates in the background of your device, using much more of your CPU than command-line-based Scrapy. On larger projects, this power use can easily snowball out of control. If CPU use isn’t a big concern, Selenium is the best option for helping you learn how to scrape dynamic websites.
If you’re drawn to Selenium because of its simplicity, Helium makes it even easier. It operates on top of Selenium (you can even call both in the same scraper). But it also offers further simplified methods for interacting with web pages that are closer to the instructions you’d give a human for navigating a page than the commands and Xpaths that other tools require.
Note about headless browsers: There are other ways to use a headless browser to help you learn how to scrape a dynamic website, similar to the way Selenium operates. For example, you can use one with Scrapy through scrapy-playwright or independent of Scrapy through Pyppeteer. But any headless browser is going to eat up your computer’s core usage, so keep an eye on the size of the scrapes you’re using these tools for.
Scraping Robot
Developers who manage frequent or massive scrapes and tech newbies who don’t have time to invest in learning the ins and outs of web scraping or how to scrape a dynamic website can rely on Scraping Robot to handle the complicated stuff. Scraping Robot can integrate proxies and servers and bound over common scraping hurdles with ease, bypassing CAPTCHAs and successfully scraping any data you need from JavaScript-rendered dynamic web pages.
The free version of Scraping Robot includes 5,000 free scrapes per month with all the available features. Multiple integrations and great features, such as browser screenshots and POST requests, are also in the works.
If you have more complex web scraping ideas and you’re looking for an ideal end-to-end solution, the Scraping Robot support team can help you build a custom scraping project. There’s no need to learn all the details of how to scrape a dynamic website when you use Scraping Robot, but it more knowledge is always helpful.
Concepts To Master Before You Start Learning How To Scrape A Dynamic Website With Python
Learning how to scrape a dynamic website with Python has a few prerequisites, regardless of the tool you use or the characteristics of the target website. Make sure you’re familiar with these concepts before you get started:
- You should be comfortable inspecting a page’s source code. In most browsers, you can do this by right-clicking an object you’re interested in (or anywhere on the page itself) and choosing “Inspect.”
- Dynamic website scraping doesn’t rely entirely on Xpath selectors for navigation, but you may still use them to gather certain types of data. Because Xpaths copied from your browser aren’t usually the most efficient paths to the data, it’s a good idea to understand some Xpath syntax so you can simplify your code wherever possible.
- Always follow web scraping best practices. Check the site’s robots.txt file before you start to make sure you’re not forbidden from scraping any or certain aspects of the data, and moderate your scraping speed to avoid overloading the server and landing your IP address on a block list.
Learning how to scrape a dynamic website will go much more smoothly if you keep the above advice in mind.
Scrape an Angular Website (Forbes.com) With Python
Now that all the basics and background info are out of the way, we’re going to scrape an Angular website, Forbes.com, using Helium. Angular is a popular, open-source framework for building web applications. This will show you how to scrape a dynamic website with the help of Helium.
Following the Helium documentation, we first install it in the command line (or Terminal on Mac) using
pip install helium
Note that it’s not necessary to install Selenium separately. It should be installed automatically as a package alongside Helium:
Successfully built helium
Installing collected packages: selenium, helium
Successfully installed helium-3.0.9 selenium-3.141.0
Then, we run
from helium import *
start_chrome(‘www.forbes.com’)
or
start_firefox(‘www.forbes.com’)
When you’re past the debugging phase of writing your scraper, you can pass headless=True as an additional argument to let the browser run in the background. For example:
start_firefox(‘www.forbes.com’, headless=True)
At this stage, though, it can be helpful to watch it work. Note that you may need to install the appropriate chromedriver for your version of Chrome and add it to the same path as your script.
Next, say you want to scrape all of the breaking news headlines from the top of the page. If you inspect those links in your browser, you can see that they’re all in the CSS class “happening__title.” With this info, you can use the find_all() method along with S() to generate a list of those elements:
# Find all elements with class=”happening__title”
breakingNewsList = find_all(S(“.happening__title”))
Then, you can iterate through the list of elements to extract the text from each into a new list, and print the list of headings:
breakingNewsHeadings = [item.web_element.text for item in breakingNewsList]
for heading in breakingNewsHeadings:
print(heading)
Vanderbilt Hospital Turns Over Transgender Clinic Records To GOP Attorney General In Investigation
Did Hunter Biden Get Off Easy? Republicans Think So—Here’s What Legal Experts Say
Titanic Sub Search: Fact-Checking Claims About The Tourist Submersible That Went Missing
Russell Simmons Accused Of Abuse And ‘Intimidation’ By Daughter And Ex-Wife Kimora Lee Simmons: What We Know
To illustrate some additional functionality, if you notice that Helium keeps halting because of a certain pop-up that it doesn’t know how to handle, you can use wait_until() to pause the rest of your scraping until you can click through the pop-up. For example, say the web page wants you to accept its use of cookies, but the banner doesn’t appear until a second or two after the page loads:
wait_until(Button(‘Accept All’).exists)
click(Button(‘Accept All’))
Then, if you want to go to a subpage of the website, you could use click() and hover() to get there through the navigation menu. Note that the “Open Navigation Menu” button title was found by inspecting the element in a browser with the Forbes site open.
click(Button(“Open Navigation Menu”))
hover(“Billionaires”)
click(“World’s Billionaires”)
On this page, you can use the find_all() method again to scrape all of a certain class of data from the table and then export it into your preferred file type.
From here, if you wanted to search for all the authors writing about a certain topic (we chose “recession”), you could add the following:
click(Button(“Search”))
write(“recession”)
click(Button(“Submit))
You could then use find_all() once again to gather all the elements of class “byline__author-name.”
If you came across a page that used infinite scrolling, you could use scroll_down(num_pixels) to continuously scroll down. If you needed to upload files or right-click links, you could use drag_file() and rightclick().
And, of course, don’t forget to call kill_browser() at the end of your script so you don’t end up with too many browsers open during your testing phase! Helium ends the program automatically, but it doesn’t clean up the open browsers on its own.
Hopefully this tutorial gives you a solid idea of how to scrape a dynamic website. Remember that Python has an extensive user base who can help you if you get stuck.
Up Your Scraping Game With Rayobyte
There are several ways to learn how to scrape a dynamic website, a few of which we’ve outlined in this guide. Though you should avoid attempting this type of scraping with Beautiful Soup (it’s just not possible), you could successfully learn how to scrape a dynamic website using Selenium/Helium, Scrapy, or a full-service web scraper like Scraping Robot.
Our slightly more in-depth example highlights Helium as an easy-to-use tool for performing a variety of scraping, web page navigation, and browser automation tasks that could easily become overly complicated using other tools (such as Scrapy). Helium closely mirrors the way a real person would interact with a page (we find click(Button) to be especially enjoyable) and avoids many of the major challenges people new to scraping encounter when learning how to scrape a dynamic website.
This scraping example was rather simple, and scraping a more sophisticated website will likely require a bit more work. But now you have an idea of how to scrape a dynamic website and, hopefully, the motivation and inspiration to learn more. If you’re a more advanced programmer with more web scraping experience, maybe this guide reminded you that you could simplify some of your processes — you don’t always need to use the most complex tool for basic tasks.
If you’re interested in the data that web scraping can produce but are lacking the time and resources to either learn how to scrape a dynamic website or build a web-scraping team within your organization, Rayobyte can help. Check out Rayobyte and Scraping Robot to start learning how to scrape a dynamic website ASAP!
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.