Cheerio Web Scraping: How This Node.js Parsing Tool Stacks Against Puppeteer
Web scraping has become the de facto way of gathering data from websites, from tracking prices to crawling images. Node.js, the JavaScript runtime environment, is popular for web scraping due to its scalability and performance. It also offers several web scraping tools, such as Cheerio and Puppeteer. Determining which is best for your situation can seem difficult since these tools appear similar at first glance. This article will explore Node.js web scraping with Cheerio, a lightweight and fast Node.js parsing library. You’ll also learn how Cheerio web scraping works by building a simple web scraper from scratch and how it compares to Puppeteer.
What Is Cheerio Web Scraping?
Cheerio is a Node.js library that parses HTML documents. It is a quick and lightweight alternative to jQuery that allows you to quickly select elements and parse web pages without additional dependencies. The Cheerio module is designed to be faster, smaller, and more flexible than jQuery, but it still uses a syntax that is familiar to most web developers.
Cheerio is not a web browser and does not execute JavaScript, making it ideal for scraping static pages. It also doesn’t require clicks, form submissions, or event handling. This means you don’t have to wait for pages to load before scraping them and can quickly extract the desired information.
Node.js web scraping with Cheerio is great for beginners and experts who want to scrape static web pages quickly. It is also much easier to use than Puppeteer, which requires more complex coding.
How Node.js Web Scraping With Cheerio Stands Out From Puppeteer
While both Puppeteer and Cheerio are built on Node.js, they have different purposes and strengths. When it comes to Cheerio vs. Puppeteer, Cheerio is:
- Faster and requires less setup than Puppeteer: Unlike Puppeteer, Cheerio does not require a browser to run and is quick to set up. It is much faster than Puppeteer, as it does not need to wait for pages to load. Puppeteer also has a more complex setup and requires more coding.
- Easier and more lightweight than Puppeteer: Cheerio is designed as a simplified version of jQuery. It has a nearly identical API to jQuery and uses the same CSS selectors for selecting elements. It also requires minimal coding and is much smaller than Puppeteer. This means your projects will run faster.
- Great for extracting specific data from web pages: Cheerio removes all the DOM inconsistencies and browser-specific implementation of jQuery, so it is perfect for quickly scraping static web pages and extracting the desired information. It can be used to track prices, crawl images, and more.
Puppeteer is more complex than Cheerio but has more features like screenshotting and PDF generating. It is also great for automating web browser tasks. And, of course, it can handle dynamic pages.
After you experiment with Cheerio, you may find that Puppeteer is better for your web scraping needs.
Building a Web Scraper With Cheerio
Before building a web scraper with Cheerio, you must have a solid understanding of JavaScript and Node.js. Cheerio is built on the Node.js platform and requires JavaScript for its syntax. You must also ensure that the following are installed on your computer before proceeding:
- Node.js
- npm (Node Package Manager)
- Code editor (e.g., Visual Studio Code)
You can check if these are installed on your computer by running the node -v and npm -v commands in the terminal. While a code editor isn’t necessary to use Cheerio, it can highlight errors and provide auto-completion. Visual Studio Code is a free and open-source code editor that is lightweight and easy to use. Once you have verified that Node.js and npm are properly installed, you can start building your web scraper with Cheerio.
Step 1: Create a folder for the web scraper
This will be the workspace for your Cheerio web scraper. Create a new folder and name it “CheerioWebScraper” or any other name you’d like.
Step 2: Initialize the project
To use Cheerio, you must initialize the Node.js project inside your folder. To do this, open a terminal window and navigate to the folder. Then, run this command to create a package.json file:
npm init -y
This is necessary for managing dependencies and setting up the project.
Step 3: Install Cheerio
Next, you must install cheerio on your project. To do this, run the following command in the terminal:
npm install cheerio
This will add cheerio to your project’s dependencies.
Step 4: Install Axios
Axios is a Node.js module for making HTTP requests. It will allow you to set timeouts, header parameters, and more to help your web scraper make more efficient requests. To install Axios, run the following command in the terminal:
npm install axios
Step 5: Write the web scraper
Now that you’ve installed Cheerio and Axios, you can build a simple web scraper that extracts the title of a webpage and prints it to the console. Create a new file in your folder and name it “scraper.js.” Then, open the file in your code editor and add the following code:
const cheerio = require(‘cheerio’);
const axios = require(‘axios’);
// URL of the page you want to scrape
const url = ‘https://example.com’;
// Make the HTTP request with Axios
axios(URL)
.then(response => {
// Load the HTML into cheerio
const $ = cheerio.load(response.data);
// Select the title element
const pageTitle = $(‘title’).text();
// Print the title to the console
console.log(pageTitle);
})
.catch(console.error);
Step 6: Run the web scraper
Once you have written your web scraper, you can test it by running the following command in the terminal:
node scraper.js
You should see the page title printed on the console if everything is set up correctly.
Web Scraping With Cheerio
Now that you have built a basic web scraper with Cheerio, you can explore and scrape real-world websites. For inspiration, you can find several examples of Cheerio web scraping on GitHub.
We’ll use toscrape.com, a free web scraping practice site. It contains two pages — a fictional bookstore and a page with quotes from famous people. You’ll be scraping the bookstore page and extracting science fiction books by authors with the last name Wells.
The first three steps of this tutorial are necessary only if you are scraping a real website.
Step 1: Check if web scraping is allowed on the website
Websites have a robots.txt file that informs web scrapers if they can scrape the site. To check if web scraping is allowed, look at the robots.txt file of the website you want to scrape. If you find a line that says Disallow: /, it means web scraping is prohibited on the website.
Step 2: Make sure the page is static
Scraping JavaScript-rendered web pages with Cheerio requires additional dependencies. It’s best to stick to static web pages for the most part. To determine if the page is static or dynamic, check the page’s file extension. If it ends with .html or .htm, then the page is static, and you can use Cheerio to scrape it. If the page ends with .php, .asp, or .jsp, it is dynamic and may require additional tools for scraping.
Step 3: Inspect the page’s HTML structure
Once you have confirmed that the page is static, open it in a web browser and inspect its HTML structure. Right-click anywhere on the page and select “Inspect” or “Inspect Element.” This will open up the page’s HTML structure and allow you to identify the element that contains the data you want to scrape. It’s important to note that HTML structure can change, so you may need to adjust your web scraper if the page updates.
Step 4: Write the web scraper
Now that you understand the HTML structure of the page, you can start writing your web scraper with Cheerio. Using the code you wrote in the previous section as a template, you’ll end up with something like this:
const cheerio = require(‘cheerio’);
const axios = require(‘axios’);
// URL of the page you want to scrape
const url = ‘https://books.toscrape.com/’;
// Make the HTTP request with Axios
axios(https://books.toscrape.com/)
.then(response => {
// Load the HTML into cheerio
const $ = cheerio.load(response.data);
// Select the books written by authors with the last name ‘Wells’
const wellsBooks = $(‘.product_pod h3 a’)
.filter((i, el) => $(el).text().includes(‘Wells’))
.map((i, el) => $(el).text())
.get()
// Print the books to the console
console.log(wellsBooks)
})
.catch(console.error)
Let’s break down the code above:
- The cheerio.load() function loads the HTML document into cheerio so it can be manipulated.
- The $(‘.product_pod h3 a’) selector targets all the elements with the class “product_pod” and the tag “h3 a.” On this website, these are all the books listed on the page.
- The filter() function filters out all the books that don’t have “Wells” in the title.
- The map() function extracts the title of each book and returns it in an array.
- The get() function retrieves the array from cheerio.
- The console.log() prints the array of books to the console.
- The catch() function handles any errors that might occur during the web scraping process.
Step 5: Process the data
The web scraper should now be able to extract any science fiction books written by authors with the last name “Wells.” To process the extracted data, you can use a Node.js module such as cheerio-table parser to convert it into CSV or JSON format. To do this, you must first install the cheerio-table parser with the following command:
npm install cheerio-tableparser
Then, use the code below to process the data:
const cheerioTableparser = require(‘cheerio-tableparser’)
cheerioTableparser($)
const tableData = $(‘table’).parsetable(true)
const wellsBooksArray = tableData.map(book => {
return book.title;
})
console.log(wellsBooksArray)
This will return all the science fiction books written by authors with the last name “Wells” that were extracted from the web page.
Scraping JavaScript-rendered Web Pages
As mentioned earlier, scraping JavaScript-rendered web pages with Cheerio isn’t easy and requires additional dependencies. Puppeteer is a Node.js library that can scrape these types of pages, but it is more complex and requires more coding.
Puppeteer is best suited for automating web browser tasks like clicking links and filling out forms. To use Puppeteer instead of scraping JavaScript-rendered web pages with Cheerio, you must write Puppeteer scripts that contain the commands and instructions you want Puppeteer to perform. For example, your Puppeteer script might look like this:
const puppeteer = require(‘puppeteer’);
(async () => {
// Launch the puppeteer with headless mode set to false so you can see what Puppeteer is doing
const browser = await puppeteer.launch({headless: false})
// Create a new page
const page = await browser.newPage()
// Go to the page you want to scrape
await page.goto(‘https://example.com’)
// Wait for the page to load and for JavaScript to execute
await page.waitForFunction(‘document.querySelector(“div”)’)
// Extract the data from the page
const data = await page.evaluate(() => {
// Select the element you want to scrape
const div = document.querySelector(‘div’)
return div.innerText
})
// Print the data to the console
console.log(data)
// Close puppeteer
browser.close()
})()
The Puppeteer script above will launch Puppeteer in nonheadless mode, navigate to the page, wait for the page to load and JavaScript to execute, extract the data from a div on the page, and print it to the console. Here’s a more detailed explanation:
- The puppeteer.launch() function launches Puppeteer and sets the headless mode to false so you can see what Puppeteer is doing.
- The page.goto() function navigates Puppeteer to the page you want to scrape.
- The page.waitForFunction() function waits for the page to load and for JavaScript to execute before Puppeteer can scrape the data.
- The page.evaluate() function extracts the data from the page and returns it in a variable.
- The console.log() prints the data to the console.
- The browser.close() function closes Puppeteer.
Many free Puppeteer scripts are available on GitHub to help you get started with Puppeteer.
Common Cheerio Web Scraping Challenges
When working with Cheerio, it is important to prepare for common web scraping challenges, including:
- HTML structure changes: As previously mentioned, the HTML structure of a webpage can change. You must continuously monitor the web page to ensure your Cheerio web scraper keeps working if changes are made.
- Slow response times: If the web page takes too long to load, Cheerio may be unable to scrape it correctly. You can try setting a higher timeout value in Cheerio or using Puppeteer instead.
- Javascript-rendered pages: Cheerio is not designed to scrape JavaScript-rendered web pages. You must use additional tools like Puppeteer for this.
- Poorly formatted HTML: If the web page’s HTML is poorly formatted or contains errors, Cheerio may fail to scrape it correctly. You can use the Cheerio table parser to convert the data into a more readable format.
- IP blocking: Some websites may block your IP address if you make too many requests in a short period. Make sure to throttle your Cheerio web scraper and space out the requests to prevent this from happening.
Cheerio Web Scraping Best Practices
Cheerio web scraping can save you a lot of time when done right. Here are some best practices to follow when using Cheerio:
- Write a script to monitor the web page for changes. This will help you keep your Cheerio web scraper up to date.
- Leverage the developer community. Many open-source Cheerio web scraper scripts are available, so take advantage of them.
- Space your requests. If you’re sending multiple requests to the same site, space them out to avoid getting blocked.
- Be aware of the legal implications of web scraping. Ensure you understand the terms and conditions of the websites you’re scraping from and follow them.
- Scrape ethically. Respect the terms and conditions of the website, don’t load too much data at once, and ensure you aren’t adversely affecting the website’s performance.
- Use different crawling patterns. Don’t scrape the same website with the same pattern repeatedly. Mix up your requests to avoid being detected and blocked.
Using Proxies When Web Scraping With Cheerio
Web scraping with Puppeteer and Cheerio is an excellent way to extract data from websites, but it has risks. Websites can detect web scrapers by looking for bots sending many requests from the same IP address. If a website detects too many requests from the same IP address, a few things can happen:
- The website might rate-limit your requests, meaning it will slow down the speed of your scrapers or even block them.
- The website might flag your IP address as a malicious entity, which will block your scrapers permanently.
- You risk being identified and reported as a web scraper, which can get you into legal trouble.
To lessen the chance of running into these issues, use proxies. Proxies mask your IP address and make it look like the requests come from different IP addresses.
Choosing the best proxy for Cheerio web scraping depends on your scraping needs and goals. It may be tempting to use free proxies, but these are unreliable and often too slow for web scraping. Paid proxies are your best bet for Cheerio web scraping, but they can be expensive, so it’s important to do your research.
Here are the best proxies for cheerio web scraping.
Residential proxies
These proxies use real IP addresses from residential networks and provide the highest success rates for web scraping. This is because websites are less likely to rate-limit or block residential IP addresses.
Rayobyte’s residential proxies are some of the best on the market. They offer high-speed proxies with unlimited bandwidth and fast connection speeds, which makes them perfect for web scraping.
Rotating ISP proxies
These proxies change the IP address after each request and are great for scraping large amounts of data. They come with a higher price tag but provide better anonymity for web scrapers.
Rayobyte’s rotating ISP proxies are fast and reliable and offer features like sticky sessions, real IP ASNs, and auto IP rotation for improved anonymity.
Data center proxies
Data center proxies are another excellent choice for accessing blocked websites. They use a network of IP addresses from data centers around the world to unblock websites, making them more reliable for cheerio web scraping.
Rayobyte’s data center proxies provide the speed and reliability you need to access blocked websites. Our lightning-fast proxies offer unlimited bandwidth and over 25 petabytes of monthly data. With our data center proxies, you can access websites from anywhere in the world without worrying about being detected or tracked.
Rayobyte and Cheerio Web Scraping
While both Cheerio and Puppeteer have great web scraping capabilities, Cheerio web scraping is easier to set up and maintain. It’s a great way to extract data from websites without the need for interactions such as clicks or form submissions. It can save you a lot of time, but it is important to follow the best practices listed above and use proxies to protect yourself from rate-limiting and blocking.
Combining Rayobyte’s proxies and Cheerio will ensure your web scraping activities are secure and successful. Our state-of-the-art rotating ISP, residential, and data center proxies help you extract valuable website data from any area quickly and safely. Our 24/7 customer service is always ready to answer your queries and ensure you have the support you need when it matters most.
With Rayobyte, you can scrape ethically, with confidence, and without worry. Ready to get started? Start your trial today!
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.