Web Scraping with Crawlee || Rayobyte

For many of today’s business owners, data is everything. The more data you have, the better the decisions you make can be and the less risk you take when making those decisions. Web scraping and browser automation are some of the ways to achieve this. They can help you stay competitive in today’s digital market by giving you updated information about what the competitor is doing and what the market has to offer. Consider the benefits of web scraping with Crawlee to see what we mean.

What Is Crawlee Python?

Crawlee is a Node.JS package. It provides a straightforward interface that can be used for web scraping and browser automation. It is quite adaptable as well. It lets users retrieve web pages and apply CSS selectors to pull data from those locations. It can also help you to navigate the DOM tree, allowing you to follow links to scrape numerous sites at the same time.

There are numerous benefits to Node JS web crawler Crawlee. Because it is versatile and can be used and provides a uniform interface for web crawling using both headless browsers and HTTP, it is suitable for most needs. Also, note that it has an integrated persistent queue that will handle URLs either in breadth-first or depth-first order.

Another benefit to using Crawlee is that it offers integrated proxy rotation as well as session management. You can benefit from using the pluggable storage solutions for files and tabular data as well. All of these features help to make Crawlee a highly desirable resource to use for web scraping, but there’s more. Crawlee also provides a hook-based customized lifestyle, error handling, routing that’s programmable, and retries.

If you are seeing all of your boxes get checked off when considering web scraping with Crawlee, there’s even more good news. The setup is rather simple and fast, as a CLI is accessible. You also have access to Dockerfiles, which can help speed up deployment across all applications.

The fact is that Crawlee for web scraping is a good option for a variety of applications. It offers a robust set of features to help you navigate your specific project needs with confidence. For web scraping and crawling written in TypeScript using generics, Crawlee Python offers key benefits that are hard to overlook.

Key Advantages of Using Crawlee for Web Scraping

There are various benefits to using Crawllee for web scraping as well as browser automation. Though every situation is a bit different – and there are dozens of other programming solutions for web scraping out there to also consider – there are plenty of reasons to choose Crawlee. Consider these.

It has a single interface. It has a single interface for HTTP crawling as well as headless browser crawling. For those who need to engage in both, this really can streamline the process. It eliminates your need to switch between the two based on the project you are working on right now.

You gain pluggable storage. Another nice feature is that it supports pluggable storage methods. That includes for both tabular data and files. That makes storing and managing the data you extract simple enough to do in the way that fits your objectives and goals.

It utilizes customizable lifecycles. Another key benefit is that Crawlee also provides you with a way to alter your crawlers’ lifecycle using hooks. Hooks are versatile and can be used to carry out various operations both before and after a specific event, again based on your objectives. One common example is setting up a hook before a request is made or after data is collected.

Proxy rotation is an option. Another benefit to using Crawlee is that it allows for both proxy rotation and session management. This is a type of built-in support that most web scraping projects benefit from, thanks in part to the complexity of anti-box software that is out there. With these features, it is possible for Crawlee to handle more complex web interactions as well as avoid blocking based on IP address, a very big limitation to web scraping success.

Configuration options are available. Also note that with Crawlee, you can configure the request routing, error handling, and retries effectively. To do this, you can control request routing based on your objects. It can also deal with errors and then will retry as much as you need it to. In situations where you are dealing with edge cases or have an unexplained issue occurring, this makes the process easier to manage overall.

It is Docker capable. Keep in mind that Crawlee is also compatible with Dockerfiles. That means, as a developer, you can rapidly and easily deploy your crawlers to production environments.

With the Crawler node offering so many benefits, you may be ready to give it a try. The good news is it can be set up rather easily and quickly, allowing you to move your project forward faster. You do not need a lot of experience to use Crawlee, but there are several steps that can make setting it up and getting started a bit easier to manage.

How to Set Up Crawlee Web Scraping

In this Crawlee web scraping tutorial, we will break down the process of setting up Crawlee and provide insight into how it works. Keep in mind that you do have to have some programming skills and knowledge to modify this process to meet your expectations.

First, to install Crawlee on your system, you will need to install Node.JS version 16.0 or higher. In addition to that, you also need to install NPM. Once you do that, use Crawlee CLI. This is the best method for getting the system set up and starting to use it. It allows you to begin to build new projects rather quickly.

To install it, create the project inside the “my-crawler” directory:

npx crawlee create my-crawler

Once you do that, the NPX CLI tool runs the Crawlee package locally. That means it does not install it globally on your device. When you run it through the command listed above, you will get a message to choose a template. Choose from those that are applicable to your project. For this example, we will focus on “Getting Started Example (JavaScript), one of the options listed. We choose that because we are doing Node. JS development.

This option will install all of the required dependencies and create a new directory called my-crawler on your device (within the current working directory).

This also adds the package.json to the folder.

You will also have example source code provided to you – and you can use that immediately based on your needs. You should see a message display on your screen after installation that says,

“Project my-crawler was created. To run it, run “cd my-crawler” and “npm start”.

The most important factor to remember here is that the Crawlee project was built in the my-crawler folder. You need to update that to change your current directory to this folder. To do that, use this command:

cd my-crawler

Then, run the next command to actually get your Crawlee project started:

npm start

Doing this creates a start command. This will start the default crawler of the project. This particular crawler will crawl through the Crawlee website. It will output the titles of all the links on that website.

Once you have reached this step, you have officially installed Crawlee. That was easy. The next step is to start using it.

How to Use Crawlee

Before diving into the next insight, you need to have a better idea of how Crawlee works. There are three types of Crawlee crawlers.

CheerioCrawler
PuppeteerCrawler
PlayrightCrawler

Each one of them is designed to act in a similar way. They will visit a website and carry out the specific tasks you need them to. They then save the results, move on to the next page, and continue this process until the job you created for them is done. It is pretty straightforward.

The crawler has to be able to respond to two key queries:

Where should I go?
What should I do there?

Once you provide definitions for these queries, you can then put the crawler to work. Most of the other settings can be pre-configured for the crawlers.

To give you a better idea of how this can work, let’s consider an example. Let’s say that you want to go to a fake website (a real one can easily be inputted into this code). You want to pull up a list of books on a website so you can monitor your competitor’s inventory levels. To do that, you want to build a web scraping tool that will pull up all of the titles of the books that the competitor has listed on the website.

This web scraping demonstration is books.toscrae.com

To get it started, you will need to open the main.js file, which you will find in the src folder for your project. You will then overwrite the code there by placing the following code in its place:

import { PlaywrightCrawler } from ‘crawlee’;

const crawler = new PlaywrightCrawler({

requestHandler: async ({ page }) => {

// Waiting for book titles to load

await page.waitForSelector(‘h3’);

// Execute a function in the browser that targets

// the book title elements and allows their manipulation

const bookTitles = await page.$$eval(‘h3’, (els) => {

// Extract text content from the titles list

return els.map((el) => el.textContent);

});

bookTitles.forEach((text, i) => {

console.log(`Book_${i + 1}: ${text}\n`);

});

await crawler.run([‘https://books.toscrape.com/’]);

This uses the PlaywrightCrawler configuration. Now, let’s break down what this code is telling the system to accomplish.

First, the code imports the PlaywrightCrawler class from the Crawlee package that was already in place. It then will create a new PlaywrightCrawler-type crawler. Instantiating this class requires an options object. This will include the requestHandler function, which is an asynchronous function that will be executed on each page that the crawler visits during this project.

The requestHandler function waits for the page’s <h3> elements to load. From your previous browsing of the site, you know that these are the titles of the books on the site. This is to be rendered using the page.waitForselector () call.

Next, you see the page.$$eval() method. This is executed in the browser’s context and will then extract the text content from all of the <h3> on the page that you tell it to visit.

The final step is that the crawler code will log the titles to the console. This is done using the crawler.run() method.

Then, to run the project, just type in:

npm start

How to Use Crawlee with Headless and HTTP Browsers

use crawlee with headless browser and others

You may remember that one of the core benefits of using Crawlee is that you can easily navigate between headless and HTTP browsers to handle the tasks you need to. This is a nice benefit, but you still need to learn how to use it.

To use headless browsers in Crawlee, there are a couple of steps to follow. It supports headless control over various browsers, including Firefox and Chromium. You can combine headless browsers with PuppeteerCrawler as well as PlaywrightCrawler. This allows you to perform true browser crawling, and it will allow you to extract the data you want, even from more complex websites.

To make this work for your needs, you will need to know the browser type, the start options, and the context options. This allows you to use headless browsers with Crawlee.

Let’s continue with the same project as listed above with the objective of pulling book titles from the competitor’s website. Let’s say you want to launch a Firefox-specific headless action to pull all of those book titles. You can add this code to the command line to make that happen:

import { PlaywrightCrawler } from ‘crawlee’;

import { firefox } from ‘playwright’;

const crawler = new PlaywrightCrawler({

launchContext: {

// Set the Firefox browser to be used by the crawler.

launcher: firefox,

requestHandler: async ({ page }) => {

// Wait for the actor cards to render.

await page.waitForSelector(‘h3’);

// Execute a function in the browser which targets

// the actor card elements and allows their manipulation.

const bookTitles = await page.$$eval(‘h3’, (els) => {

// Extract text content from the actor cards

return els.map((el) => el.textContent);

});

bookTitles.forEach((text, i) => {

console.log(`Book_${i + 1}: ${text}\n`);

});

await crawler.run([‘https://books.toscrape.com/’]);

That’s pretty straightforward. But remember, you have other options to consider as well. Try it out with this command first so you can see how well this method may work.

Remember, though, that with Crawlee, you can also use the features for HTTP. If you are considering that, check out some of the benefits of HTTP Crawling:

HTTP crawling allows for automation configuration of browser-like headers. It also allows for replication of browser TLS fingerprints, which can be critical to your project. Also, you can scrape JSON APIs and use zero configHTTP2 support, even when you are using proxies. It also offers integrated fast HTML parsers.

Another option is to use Crawlee’s real browser crawling feature. This option is versatile enough for most needs as well. With real browser crawling using Crawlee, you can capture JavaScript rendering and screenshots. It also provides headless and headful support. There is automatic browser management and zero-configuration generation of human-like fingerprints. Also note that you can use the Playwright and Puppeteer within the same interface with this method, which may be beneficial for some applications. You can also use it on most browsers, including Firefox, Webkit, and Chrome.

Using Web Crawlee with Proxies

Proxies are a critical tool for various applications and needs today. A proxy works to deflect your personal information. In short, it allows you to hide your IP address and other information so that no one can tell who is making the requests.

Crawlee provides a built-in level of support for managing proxies. That means that you can easily choose between one of the available proxies provided to you (there is a list) and avoid many of the restrictions that typically occur without them. That includes IP-based restrictions, such as on your location, as well as website blocking.

To do this, construct a new instance of the crawler you just designed. Then, give it a ProxyConfiguration object. Supply a list of the proxies you want it to use. You can also specify the rotation method, which means that you set up the process to rotate the proxy being used over time. It can be set up to rotate the proxy request as often as you like, even with every request.

We also encourage you to use Rayobyte’s proxy service to help you unblock restrictions that are limiting the overall effectiveness of your current campaign. More on that in a moment.

Let’s say you want to integrate proxies with Crawlee. It is not a hard process to follow, and it starts with using the ProxyConfiguration class. You create an instance of this class and then provide the required options. If you are unsure what proxy configuration options you have, Crawlee offers a list of them right on its developer’s website.

Why Use Rayobyte to Unblock Websites

To maximize your reach with web scraping with Crawlee, it is beneficial to use a web unblocker. This type of tool, like the one available to you from Rayobyte, is a simple way to get through the clutter and access valuable data even when you are blocked from every direction.

With Rayobyte’s tool, you can mimic real traffic easily, and no API is required. You can use a website unblocker for various needs, but even with just a proxy selected through Crawlee, that can still limit your reach. Automated traffic to websites is easily blocked when anti-bot software is put in place, which is common on many of the more modern and sophisticated websites. With Web Unblocker, the process of capturing information seems like real, authentic traffic to the website, meaning this method allows you to get around the blocks that detect automation. With human-like browsing, you capture the accurate information you need, but you can avoid the blocks in place.

We make this entire process easy to manage, and it only takes a few minutes to add in this configuration. You can add the code to create a new instance of the ProxyConfiguration class by pasting in a list of Rayobyte’s Web Unblocker endpoints. Once you do that and include the name of the proxy, Crawlee will use that proxy server to handle your project.

The system uses ProxyConfiguration object to initialize a Playwright crawler instance. Your proxy list will contain a list of the unblocker proxies that are available for you to use.

One note here is that when you set this up, you have to add your username and password from Rayobyte. This is the same as the one you are using for your login credentials for your proxy service from us as well.

Now Is the Best Time to Make a Switch

When it comes down to it, web scraping and browser automation with Crawlee is a solid method for gathering valuable information that fits your needs. With web Crawlee so easy to use, this node JS web crawler is applicable to many of the projects you have out there.

To get the most work out of the product, make sure you use the crawler node along with Rayobyte proxies. This can help streamline the process and provide you with the best long-term outcome. You can adjust the process to fit just about any need you have.

Check out the options and tools that we recommend at Rayobyte, and start capturing the highly beneficial data that is so critical to your business decision-making. Contact us for more information.

The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.

Web Scraping with Crawlee

What Is Crawlee Python?

Key Advantages of Using Crawlee for Web Scraping

How to Set Up Crawlee Web Scraping

How to Use Crawlee

How to Use Crawlee with Headless and HTTP Browsers

Using Web Crawlee with Proxies

Why Use Rayobyte to Unblock Websites

Now Is the Best Time to Make a Switch

Table of Contents

Real Proxies. Real Results.

Kick-Ass Proxies That Work For Anyone

Start a risk-free trial today and see the Rayobyte difference for yourself!

See Expert Reviews

Headquarters

Web Scraping with Crawlee

What Is Crawlee Python?

Key Advantages of Using Crawlee for Web Scraping

How to Set Up Crawlee Web Scraping

How to Use Crawlee

How to Use Crawlee with Headless and HTTP Browsers

Using Web Crawlee with Proxies

Why Use Rayobyte to Unblock Websites

Now Is the Best Time to Make a Switch

Table of Contents

Real Proxies. Real Results.

Kick-Ass Proxies That Work For Anyone

Related blogs

Getting Started With Web Scraping Automation

Powershell Web Scraping

Power Automate Web Scraping

Automated Web Scraping Tools