Playwright Web Scraping With Python (Complete Guide)
Web scraping with Playwright for Python is a method of scraping the web using Playwright. What does Playwright mean? Playwright is a cross-browser automation tool that can be used to scrape the web. In this article on Playwright web scraping, you’ll learn how to get started with Playwright, how to use it to scrape the web, and how to use proxies to make your web scraping automation more effective, efficient, and secure.
Playwright For Web Scraping
What is the Playwright tool? Playwright is an automation tool that has APIs that can be used with JavaScript, Java, .NET, and Python. It was initially released by Microsoft in 2011 as a cross-browser framework for end-to-end testing of web applications for bugs, functionality, and browser support. Playwright is a Node.js library, but you do not need to have Node.js installed on your computer before you can do Playwright web scraping with Python. You would need Node.js installed on your computer if you were using JavaScript.
Playwright simulates user events such as clicking on links or navigating through the web app by using buttons or menus. Its automation of simulated user events makes it a popular framework not only for testing websites but also for other automated activities such as web scraping.
The framework has headless web scraping support to make web scraping efficient by skipping the rendering of images and other web content. Headless tools work in the background, without a graphic user interface (GUI), and are accessed and tracked through the command line. Playwright has cross-browser support and can work with Chromium-based browsers such as Google Chrome, Firefox, and Web-Kit-based browsers such as Safari.
Here is a basic Playwright web scraping overview, which is provided in more detail in later sections:
- Figure out which sites you want to scrape and what goals you have for web scraping.
- Download and install a programming environment such as Python, complete with integrated development environment (IDE) software such as Visual Studio Code or PyCharm.
- Install Playwright using the pip command, then install the necessary binaries.
- Write a program that can automate the launching of a browser and the visiting of web pages to collect, or scrape, the desired data from the websites.
- Have the program parse the HTML or XML data into readable, usable data.
- Use a proxy server to hide your internet protocol (IP) address.
- Export the parsed data to save to your computer or a database for further analysis.
Playwright web scraping purposes
Scraping the web, including Playwright web scraping, can be useful in numerous ways.
- Individuals can scrape the web for large amounts of data for personal research or scholarly studies.
- Developers can use web scraping software to test their websites.
- Investors can analyze the stock market or the prices of currencies, real estate, bonds, commodities, or any other investment vehicle.
- Data analysts can scrape and compile data sets for machine learning.
- Consumers can gain information on price fluctuations to guide their purchasing behavior.
- Companies can achieve a competitive edge by staying up-to-date on technological or legal news.
- Companies can generate leads by scraping publicly listed email addresses and phone numbers of potentially interested customers and clients.
- Companies can conduct reconnaissance on competitor behavior.
- Organizations and companies can learn about public sentiment by scraping social media comments and forums.
These are just a few reasons why you may want to start web scraping with Playwright Python or other libraries and languages. When you know what your goals for scraping the web are, you’ll need to research web scraping tools. Tools you will need are a computer with an internet connection, a programming environment such as Visual Studio Code, and whichever programming language and libraries you choose to get the scraping done. Playwright web scraping is just one option for scraping the web.
Puppeteer vs. Playwright for scraping
Puppeteer is another option for scraping the web. The Puppeteer framework, built by Google, is a headless browser testing tool that simulates user inputs such as clicking links, filling out forms, and using buttons. This is similar to using Playwright’s automation abilities. Like Playwright, Puppeteer is also a Node.js library, but it can also be used with Python.
Playwright web scraping is similar to Puppeteer web scraping. One difference is that Puppeteer only supports Chrome and Firefox but not Safari. Like Playwright, Puppeteer supports JavaScript. Puppeteer has been unofficially ported to Python through libraries such as Pyppeteer.
To use Puppeteer, you can install the package by running the command “npm install Puppeteer.” Then, you can use Puppeteer’s API for testing to automate web browser functions and simulate user actions such as interacting with web page elements. You can also use Puppeteer for web scraping.
There are several different libraries and languages to choose from for web scraping, including Playwright, Puppeteer, Selenium, and Beautiful Soup, to name a few. Whichever framework you use for web scraping, be sure to only scrape the web legally and ethically. That means respecting each website’s robots.txt file and terms of service, which may limit how you can use automated tools on the site or may forbid web scraping altogether. You should also scrape sites that are visible to the public and not behind paywalls or other security measures, such as password-protected sites or pages.
Python Playwright Web Scraping Steps
For this article, it is assumed you will be using Python to scrape the web with Playwright. Your Playwright web scraping process may differ slightly, depending on which software you may already have downloaded and installed, but it will include most of the following steps.
1. Develop your Playwright web scraping plan
Before you start Playwright web scraping, develop your goals and your plan of action. You’ll need to know what you’re looking for so you know what kind of sites to target. Once you know what kind of sites to target, you’ll need to visit those sites and look at the source code of the site. Not all websites are built the same way.
If you are looking for news headlines, for example, the HTML code and the precise wording of the elements that include headlines will differ slightly between news sites. The same is true for other kinds of data.
2. Download and install Python and an IDE
Download the latest version of Python. Python 3 can be used on Windows, Mac, or Linux. You may also want to download the Integrated Development and Learning Environment (IDLE), especially if you don’t have a different IDE. You can use web-scraping scripts in IDLE as well as other IDEs such as Visual Studio Code or PyCharm. Install the Python extension for Visual Studio Code for debugging.
You will also be running commands in a command prompt such as Python CMD. You’ll be using the package management system pip to install Playwright. Pip should already be installed automatically in most cases.
3. Install Playwright and Playwright browser binaries
Once you have Python installed, open your command prompt and navigate to your Python directory. After that, installing Playwright is as easy as using the pip command in a command prompt:
Pip install Playwright
After installing Playwright successfully, you should see a result in your terminal that Python has successfully installed Playwright. If you see errors such as Python not recognizing the pip command, you may not have pip installed. Run the following command to see if you have pip:
Pip –version
If the error persists, then you need to download and install a file that can install pip for you.
After installing Playwright, use the Playwright command in the command prompt to download the necessary Playwright binaries that will let you use browsers such as Firefox and Chrome:
Playwright install
After running that command, you should have everything you need to start Playwright web scraping.
4. Launch a web browser and navigate to a website
Once you have everything installed, including Playwright and Playwright binaries, you are ready to start using Playwright with Python. If you’ve completed Step 1, formulating a Playwright web scraping plan, you will likely already have an idea about the kinds of websites you want to scrape. For this guide, the site https://lite.cnn.com will be used to scrape news headlines.
For now, you can use the following code to simply print the entire contents of the HTML of the page to the terminal or the IDLE shell.
from playwright.sync_api import sync_playwright
#Import Playwright
with sync_playwright() as p:
#Choose a browser, such as Chromium
browser = p.chromium.launch()
#Create a page object
my_page = browser.new_page()
#Change the URL to the site you want for your Playwright web scraping project
my_page.goto(“https://lite.cnn.com”)
#Create an object to contain the content of the page
page_content = my_page.content()
#For example purposes, simply print the page contents to the terminal or shell
print(page_content)
#Close the browser
browser.close()
Since this will print the entire contents of the page and doesn’t make any attempt to parse the text, the output will be very long. If successful, you should see the HTML of the website https://lite.cnn.com in your terminal.
As a legal disclaimer, you should only use the content scraped from that website for personal use and should respect the terms of service and intellectual property of CNN. The use of https://lite.cnn.com in this guide is just for demonstration purposes. Also, use web scraping best practices, which include limiting the amount of requests sent to a single site over a short duration.
5. Extract and save text elements from the page
If you’ve run the script above, then you’ll see that the entire contents of that particular website are printed to your terminal. This is not very useful in its current form. You’ll likely want some type of data from the website that you can save for further analysis. This data can be product prices, product descriptions, phone numbers, or any other type of data you want to scrape. In this example, the data to scrape are the headlines on a news site.
To be able to extract the data, you need to parse the website so it is readable and in a form that you can save and read at a later time. Openpyxl is a Python library that will let you save the data to a spreadsheet on your local machine.
The following example script will allow you to start scraping news headlines and create a spreadsheet. Your Playwright web scraping project will have a different URL and different elements on the page to scrape. For example, you may be scraping product prices that will have different HTML tags than the HTML tags in this example.
from playwright.sync_api import sync_playwright
import openpyxl
#Create an Excel object and a sheet title for the saved data
excel = openpyxl.Workbook()
sheet = excel.active
sheet.title = “Title”
with sync_playwright() as p:
# Start a browser and create a page object. This example uses the Chromium web browser
browser = p.chromium.launch()
my_page = browser.new_page()
# Navigate to the URL. In this Playwright web scraping example, a news headlines site is scraped.
my_page.goto(“https://lite.cnn.com”)
# Use Playwright’s query selector to get all headlines. In this example, there is an ID of “card–lite” for each headline on the HTML page.
headlines_elements = my_page.query_selector_all(“.card–lite a”)
news_headlines = [element.text_content() for element in headlines_elements]
for headline_text in news_headlines:
#For each headline, add it to the spreadsheet
sheet.append([headline_text])
# Close the browser
browser.close()
#Change the path to where you want the Excel spreadsheet file saved
excel.save(‘c:/path/to/excel/file/excel_file.xlsx’)
#Print a message to the terminal to know when the Playwright web scraping is complete
print(“Playwright web scraping finished”)
After running the above script, you should have a message printed to the terminal “Playwright web scraping finished” and a spreadsheet file in your selected directory containing the headlines for the day on https://lite.cnn.com.
Quotes scraping example
The particular site that you replace the CNN site with will depend on your scraping purposes. If you are using another site to scrape, you will also want to inspect the HTML and CSS of that site to look for elements of the page to scrape. As another example, you may want to scrape quotes from quotes.toscrape.com. This is a site that lets you sort quotations by tags and makes it easy for you to use web scraping methods and tools to collect quotes.
If you visit that site, you can right-click and choose “View Page Source” to see the entire page in HTML or “inspect” to see the specific parts of the HTML that are relevant. In this case, you would look for a quote and find the HTML tag to use in your script.
You can see that the quotes are in the <span> tag within a <div> tag that has the class “quote” nested within two other <div> tags. That tag will be particularly useful as you will use it to find each quote on the page to scrape. This is an example of why it is important to visit the sites you are going to scrape and inspect the HTML tags of the relevant data.
In this example, after selecting the tag “humor” and inspecting the site for HTML tags, you find that the quotes are nested in the following HTML elements:
<div class=”row”>
<div class=”col-md-8″>
<div class=”quote” itemscope itemtype=”http://schema.org/CreativeWork”>
<span class=”text” itemprop=”text”>“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>
<span>by <small class=”author” itemprop=”author”>Jane Austen</small>
<a href=”/author/Jane-Austen”>(about)</a>
</span>
As shown above, the quote is nested in elements that have the tags “row,” “col-md-8,” “quote,” and “text.” Your Playwright selector will target these tags and scrape all the content in each of these nested elements on the page that contains quotes.
The rest of the code is similar in that the script will scrape all of the contents on the page using the aforementioned selectors and save the content to an Excel file.
This is the script for this Playwright web scraping example:
from playwright.sync_api import sync_playwright
import openpyxl
# Create an Excel object and a sheet title for the saved data
excel = openpyxl.Workbook()
sheet = excel.active
sheet.title = “Quotes”
with sync_playwright() as p:
browser = p.chromium.launch()
my_page = browser.new_page()
# Navigate to the new URL where quotes are present
my_page.goto(“https://quotes.toscrape.com/tag/humor”)
# Use Playwright’s query selector to get all quotes and authors
quotes_elements = my_page.query_selector_all(“.quote .text”)
author_elements = my_page.query_selector_all(“.quote .author”)
# Extract the text content from each quote and author
quotes = [element.text_content() for element in quotes_elements]
authors = [element.text_content() for element in author_elements]
# Add each quote and its author to the spreadsheet
for quote_text, author_name in zip(quotes, authors):
sheet.append([quote_text, author_name])
# Close the browser
browser.close()
# Change the path to where you want the Excel spreadsheet file saved
excel.save(‘c:/path/to/your/directory/quotes_file.xlsx’)
# Optionally print a message to the terminal to know when the scraping is complete
print(“Quote scraping finished”)
The above examples are two very specific uses of Playwright web scraping. To successfully carry out Playwright web scraping, you need to inspect the HTML for the pages you want to scrape and find the HTML tags that the Playwright selectors will find to scrape. After doing this, you can use a similar script to the examples above to plug in the relevant HTML tags for your target website.
Playwright Web Scraping and Proxies
When scraping the web for vast amounts of data, you will be using automation to visit websites. Some of the websites your automated program will visit may have mechanisms in place to prevent bot activity and will block your internet protocol (IP) address if they suspect you are using automation to visit their site for any purpose, legitimate or not. Even though you are using automation ethically and are not trying to do anything malicious, you can still get blocked.
To get around getting blocked from these sites, you can use a proxy. A proxy essentially hides your real IP address by using a different IP address to visit the sites for you. These IP proxies can be purchased in bulk and cycled so that there is less of a risk that even your proxy gets blocked. Proxies can be purchased from sites such as Rayobyte that offer an array of proxy types and blocks of IP addresses, including residential proxies, which are IP addresses that look like regular residential web traffic.
Playwright proxy use case
Before you can use your proxy, you’ll need to know the IP address and port. Python allows you to include the IP address and port of your proxy directly in your script. This is accomplished by including a proxy parameter in the browser launch function. This parameter will be a dictionary containing the server, username, and password for the proxy.
This is the script with the proxy information included:
from playwright.sync_api import sync_playwright
import openpyxl
excel = openpyxl.Workbook()
sheet = excel.active
sheet.title = “Title”
with sync_playwright() as p:
”’ Inside the browser object, pass a proxy parameter in the form of a dictionary that contains the server as well as the username and password if your proxy requires a username and password. Change the server IP address, username, and password in the code below.”’
browser = p.chromium.launch(proxy={“server”: “http://the-proxy.com:####”,
“username”: “myUsername”,
“password”: “myPassword”})
page = browser.new_page()
my_page.goto(“https://lite.cnn.com”)
headlines_elements = my_page.query_selector_all(“.card–lite a”)
news_headlines = [element.text_content() for element in headlines_elements]
for headline_text in news_headlines:
sheet.append([headline_text])
browser.close()
excel.save(‘c:/path/to/excel/file/excel_file.xlsx’)
#Print a message to the terminal to know when the Playwright web scraping is complete
print(“Playwright web scraping finished”)
You can also use a JSON file to read the details of your server login credentials if you are frequently writing them into your scripts. To do this, make a JSON file as follows and call it “login.json” or a similar file name:
{
“server”: “http://the-proxy.com:####”,
“username”: “yourUsername”,
“password”: “yourPassword”
}
To read the JSON file, add an import at the top of the script to import the JSON library. Then, use a basic “open” function to open the file, read the contents, and then use the contents of the file, which are the username and password in this example.
#Import the JSON library
import json
# Load your login credentials from the JSON file
with open(‘login.json’, ‘r’) as file:
credentials = json.load(file)
username = credentials[‘username’]
password = credentials[‘password’]
”’ From there, you can use the username and password elsewhere in your scripts. Alternatively, you can use the above code directly before the browser object in your Playwright web scraping code.”’
#Put here the imports and Excel code
with open(‘login.json’, ‘r’) as file:
proxy_config = json.load(file)
with sync_playwright() as p:
browser = p.chromium.launch(proxy=proxy_config)
page = browser.new_page()
#The rest of the Playwright scraping script is the same.
Which way you write the code is a matter of preference. The results are the same. You should be able to use your proxy, provided that you wrote in the correct server and login credentials.
Final Thoughts on Playwright Web Scraping
Playwright is a powerful tool that can be used to test websites, automate processes, and scrape the web. Playwright web scraping can be an easy and effective way to gather large amounts of data from web pages. As you can see from this guide, Playwright web scraping is not too difficult and can be done with only a few steps, a bit of Python code, and only a few pieces of free software. Proxies can make Python Playwright web scraping even easier, more private, and more secure.
For more information on Playwright web scraping, proxies, and other related topics, visit Rayobyte’s vast knowledge base, where you can find informative articles. Also, let us know if you need proxies or even a web scraping robot that can do the scraping for you. Contact us today.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.