Web Scraping With Scrapy – A Complete Tutorial
Web scraping is a critical tool that helps companies and organizations gather valuable data from numerous websites quickly and constantly. When you consider the expansiveness of the information and data that is on the internet, it’s clear that there is wealth in harvesting and using that information. Web scraping helps to do that in an efficient and often automated way.
Scrapy is one tool that handles this process. In this Scrapy tutorial, you will learn what Scrapy Python is and how you can use it rather easily to create data that is usable for various applications.
Getting anything you need from the internet is easier than ever. You can use an API or a web scraping process. If you want the best web scraping performance possible and you need efficiency, using Scrapy makes sense.
What Is Scrapy?
Scrapy is a commonly used web scraping tool. It is a Python-based tool, and it is open-sourced. This framework for web crawling allows users to extract valuable insight from unstructured data that is found online in various forms.
Scrapy was designed specifically for large-scale projects. Because it is a flexible processing framework model, it can gather data from websites and then store it in the desired format to be used later. Because of the complexity of web scraping, it can become a very bogged-down process if you have to write code for every component and task. With Scrapy, much of that is done for you, and you do not even need a significant amount of coding or other skills to set up and start using Scrapy for web scraping.
Scrapy has many features and tools. It also incorporates newer features, such as the ability to create a spider and then run it. You can then save the data that is being scraped easily and quickly. That may sound confusing, but the process does not have to be (and once you create one project, you are likely to want to use it for very refined tasks involving numerous data strategies).
What This Scrapy Tutorial Will Offer
Utilizing this Scrapy tutorial, you will work through several key steps:
#1: Install and create a project with Scrapy.
#2: Extract data and product information.
#3: Handle Pagination.
#4: Run Scrapy from a Python Script.
#5: Utilize Proxies to Enhance Outcomes.
How to Set Up Scrapy
It does not have to be complex to download and use Scrapy. It was built, in fact, so that anyone, even those without a lot of programming language experience, can utilize it. Let’s talk about how to set up a project for it. If you have not done so yet, it may be a good time to freshen up your Python skills (a Python tutorial can help you do that if you have not used it in some time).
First, Install Scrapy
To get started, you have to install Scrapy. You can download it at Scrapy.org/download. There is also a comprehensive Installation Guide for Scrapy available from the organization if you find yourself struggling with any component of the process.
To install Scrapy, open your Python command terminal. Then, type in the pip command to help you:
1 | pip install scrapy
Not too challenging, right? It is likely to take a few minutes for the installation process to complete. That could be the hardest part of the entire process!
Once it alerts you that the installation of Scrapy is complete, you can then create a Scrapy project. To do that, follow this prompt:
1 | scrapy startproject <project_name>
Within this description, you will need to remove “project_name” and install create a project for your name. Here is what it may look like if you decided to name your project “datascraping”
1 | scrapy startproject datascraping
Once you do this, the tool will then create a new folder under the name you have selected. This will be within the current directory contest. Once you create files for your project, they will go within this project file itself. It’s a very organized process that is simple to use.
The next step in this Python Scrapy tutorial will be creating a spider, which will do the work of gathering information for you. To generate a spider, then, you must update the current director to the current folder name. In this example, the name of the current directory should be “datascraping” Use the cd datascraping command to do this.
Once you do that, you can then create a new spider. To create the spider, incorporate the following command:
1 | scrapy genspider <spider_name> ,url_domain>
For those with any type of experience using similar tools, it is already clear to see how easy it can be to navigate and this Scrapy tutorial will continue to help you with the process.
Like in the previous example, you will need to replace the “spider_name” component along with the “url_Domain” component with the actual names you need to use.
Let’s say we want to create a spider called photos, and we want the spider to target a specific URL called myscrape.com. The command would look like:
1 | scrapy genspider photos myscrape.com
This is all it takes to set up the actual Scrapy web scraper for this project. Now, when you look at the Scrapy project folder, you will see several key factors listed. Here is what you need to know about each one of these in Python Scrapy. (Note that in this Python Scrapy tutorial, we are keeping things very basic, but it can become more elaborate over time.)
Every Scrapy project will have the following files and folders established.
Spider folder
The spider folder is the hub for all of your information. It will contain the spiders that will then extract data from the various websites you list. Every spider will have a specific target website that it heads off to or a set of websites to target.
Most of the time, your spider will have a set of rules to follow. These rules help to tell the spider how to visit and navigate the website you point it to, along with the data you want it to extract. It’s a very simplistic design on purpose.
Items.py
The next element contained within your folder will be an items.py. This will contain the objects that represent the data that you want the spider to pull from the website. This is where you need a bit of Scrapy tutorial for Python. The items will be defined using Python classes. This helps to organize the data in a structured format for use.
Middleware.py
The next element within the folder is your middleware.py section. This file will hold the middleware information that is necessary for request routing. If you would like to do so (and with advanced web scraping with Python using Scrapy & Splash, you may need to), write out a custom proxy middleware within this file.
Pipelines.py
The next step down is the pipelines.py. This file will be used after the spider heads to the website and starts to extract the data. It heads here where the data is then processed and stored until you need to use it. It is presented in a structured format (so it is usable to you).
Note that pipelines help to put in place a set of processing steps on data. That is, it helps to create the steps that should be taken to extract the data or what should be done with that data when it is being extracted.
Settings.py
You will also notice this component in your file. The settings.py will contain the behavior that the Scrapy framework is taking. You can use it for a variety of tools and behavior controls. That may include, for example, limiting maximum concurrent requests or using an agent string.
Scrapy.cfg
There will also be the Scrapy.cfg file present. This is the plain text file that will include a series of configuration directives (which you set as desired). This will contain, for example, the project name and all of the settings that need to be used when running the spiders.
Most of these details are not complex, and as noted, you can find much of the information you need in the Scrapy tutorial Python setup tool provided by the company at download.
How to Customize a Scrapy Spider
Now that you have a lot of solid foundational start in this Scrapy tutorial, you need to move into the more important elements including customizing the Scrapy project.
Let’s refresh for a moment. In our example, we were looking for “photos” and we had a target URL of myscrape.com.
Our goal in this project is to tell Scrapy to scrape all of the photo information from that website. That sounds pretty straightforward, and it can be if you learn how to apply the strategies here.
You will need to look at your code to see what the actual process will be doing.
For example, the URL for the page is available within the href attribute of the <a> tag. Also, note that this <a> is enclosed by a <li> tag with the next class.
To scrape data that is on that URL related to photos, we need to have a spider that will be able to log into the system (if needed) and then find the appropriate page to scrape.
Here’s How to Handle the Next Steps
Create a spider
Let’s create a spider for products at myscrape.com. To do that, we would use the command
scrapy genspider products myscrape.com
This will pull up a spider, but it will not contain any specifics yet. You have to tell it what steps to take to achieve the desired result. If you look at the details, you will see the following:
- Name – this is the reference that Scrapy commands will apply to
- Allowed_domains refers to the URL where the spider will crawl (and it offers a bit of a safety feature to restrict the spider to just those domains).
- Start_url refers to the starting point for the spider to move, where parse() is the first callback. This represents a callback to execute.
Adding the crawling logic
In this particular example, the start_urls will be the starting point of the scraping logic. Then, the scraper should call the parse () function. This will find all of the product links to schedule them for scraping. We need to update this information to provide some crawling logic such as to find product URLs that fit a specific goal.
Adding parsing logic
The next step in the process is to take that basic crawling logic just completed and then create the parsing logic from it. To do this, you will need to tell it what to extract from the page. If you are looking at a product’s description, for example, and you want to pull the title and price, you need to tell the spider to capture those details. This is where you will customize it to fit your specific application needs.
Basic Settings in Scrapy Tutorial to Consider
One of the reasons many like to use Scrapy is because it does not have a lot of complex settings. Instead, it relies on the expected built-in defaults. Of course, it is not always desirable or beneficial to use these built-in defaults. That is why it may be necessary to make changes to the code to reflect exactly what recommendations you need.
Once you go through and make adjustments to the basic settings that align with your objective within this project, the next step is to execute (it really can be that streamlined to complete this project!)
How to run spiders
Now that you have the foundation in place, you can start to execute the spiders or get them off to capture the data that you need. However, there are two ways to do this.
The first is through the Scrapy command as you have been doing up to this point. The second method is to explicitly call Scrapy via a Python script. To do this, it is often beneficial to use a Scrapy CLI tool to help you. That is because Scrapy is complex.
Once you make the decision and launch the process, just watch the results. What you will find about Scrapy are a few key things:
- The logs are very easy to navigate and read
- Scrapy logs everything the engine is performing
- It logs the returned results as well
In many situations, it will also provide you with some scrape statistics. For example, it may tell you how many items or products were scraped. It also will include information about the length of time it took to start and finish the project.
Save your Scrapy results
Most of the time, you will have more than one spider. Most of the time, you will see fantastic data and want more of it. However, the data can only be used if you save it.
With the previous steps, you created a spider that scrapes product data and then logs the results. You will need to save the file. To do that, you will update the “scrapy crawl” command to reflect this step.
It is also possible to configure the FEEDS setting. In doing this, the scraping is automatically stored in a file you have previously selected. One of the nice benefits of this is that it allows you to configure multiple output storages for the data. You can provide some deep detail.
You will find that Scrapy supports various feed exporters, including S3 from Amazon and Google Cloud Storage.
What Are Scrapy Limitations?
In all situations, you will find some limitations to the functionality and span of a web scraping tool. While Scrapy is an excellent choice overall for static websites with a lot of data you want to pull, there are some aspects that can limit its overall success.
Key Details About Scrapy and Optimizing Its Functional Use
It is important to note a few key things about Scrapy as you work towards incorporating web scraping. First, Scrapy is used to create a web scraper, as noted, and that process is very easy. However, there are some challenges that can arise when putting the actual spider to work in practice. For example, there may be instances in which an IP ban occurs, or there are concerns with CAPTCHAs. These would, for example, impede the process and progress of the spider. Managed sessions and cookies can also be a problem.
A core concern is dynamic websites. Scrapy is best suited for static websites. It can pull huge amounts of data rather easily like this. However, dynamic websites cause Scrapy to stumble a bit. That’s because these websites are presented with different contacts that require consideration for variables. Some examples might include age, gender, or location, and in some situations, there may be system configuration requirements.
This is important because, when two different users visit the page, they will input different information, and as a result, they will receive different information. Dynamic websites like this aim to provide very personalized content for the user. That is not a bad thing, but it can make Scrapy harder to use.
That does not have to be the case, though. Scrapy is not necessarily made for dynamic websites, but it can scrape dynamic web content. To have this occur, you must schedule Scrapy to run on a regular basis. This enables you to scrape dynamic content. Then, save the results and compare them to monitor changes to the pages over a period of time.
How can you scrape dynamic web pages with Scrapy then? There are a variety of routes to take. For example, you could use a browser automation tool. One of the most commonly selected is Selenium, though there are many others to also take into consideration. This allows you to navigate the more complex components of these websites.
Ready to Dive Into Scrapy Software?
Web scraping is a fascinating process. It is filled with opportunities to create exceptional information and insights into some of the most important and complex decision-making companies have to do. Data is more within reach than ever.
There are certainly a lot of reasons to use Scrapy software. It is one of the most versatile components out there for web scraping. While this Scrapy Tutorial gives you the foundation from which to build, also note that there are a lot more features and methods that can be applied to using this technology.
Python Scrapy tutorial details are available to help you further improve your functionality and the overall breadth of data. For example, you can extend Scrapy. Because it is such a configurable framework, it can give you a wide range of ways to modify its functionality to keep your specific goals in mind. This is often done through extensions and middlewares, pipelines, and general extension slots. It is worth learning more about each of these areas in a Scrapy tutorial Python because it will provide you with more insight and support on how to create these types of customizations.
The key limitations of Scrapy relate to dynamic websites, but there are strategies to work around this. For example, the use of proxies could offer some support in some applications. We encourage you to explore how this may apply in your specific situation including how proxies and Python web scraping work together to help you to get the information you need in place.
Web scraping for beginners, like this Scrapy tutorial focuses on, can be amplified with the help of proxies. Our team at Rayobyte is happy to help you with a wide range of tools to meet virtually every goal you have. Take a closer look at the resources we offer and how you can get your project started today.
At Rayobyte, we offer the products you need to ensure that your web scraping data is up-to-date and as accessible as possible.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.