Asynchronous Web Scraping Python
One of the benefits of web scraping is the power to capture huge amounts of data effectively and efficiently. Yet, the process can demand significant resources and may make it challenging for you to keep the project moving forward at the speed you need.
The way around this is through asynchronous web scraping with Python. Utilizing Python’s asynchronous capabilities, you can rapidly enhance your efforts and speed up data extraction to impressive levels. To do this, you’ll use libraries available for Python that can support the process, including asyncio, aiohttp, and HTTP.
What does this do, though? Why is it better? We’ll dive into the details and build a strong case for building your Python asynchronous web server to up your efforts and achieve your best possible outcome.
What Is Asynchronous Web Scraping?
Web scraping is the process of capturing data on various websites to achieve specific goals. Asynchronous web scraping, which is often called non-blocking or concurrent web scraping, is a specific process that enables users to begin a lengthy process and still respond to other events. In short, you can do more than one thing at a time instead of having to wait for a long time for a task to complete.
By using the various libraries available, including the asyncio library, you are able to write the necessary non-blocking codes and, with other tools, improve the overall precise functionality of your HTTP requests.
With so many potential benefits and functional options, asynchronous web scraping with Python has become a very popular tool for developers. It lets you run more than one scraping task at the same time, making the most out of your time.
One of the most important reasons to utilize this process is because, in many situations, web scraping programs take a lot of time just waiting for a response. With every request sent by a web scraper, it is necessary to wait for a response from that web page. That wait may not be very long in reality, but when you apply this process over and over again and request significant amounts of data, it can slow down your overall process.
In traditional synchronous web scraping, each task is completed one after the other. Let’s say you are scraping data from just three websites. Task one takes the system about 30 seconds, while the second site takes another 15 seconds, and the third requires 25 seconds. In this situation, one task finishes, and the next starts. In general, it will take 70 seconds to move through this process.
With asynchronous, the programs run at the same time. Instead of waiting for one to finish, each runs at the same time. In this case, the longest task determines the length of the process, which is 30 seconds.
When you expand this over large amounts of data and numerous websites, you can see how that time difference – more than half in this case – really makes a difference.
How to Create Python Asynchronous Web Framework
With such benefits possible, it makes sense that you will want to utilize these Python libraries and tools to establish asynchronous web scraping. Unlike traditional synchronous web scraping, which requires using the requests library, asynchronous web scraping will require the establishment of an event loop. This is a dynamic process that allows tasks to operate independently and at the same time.
There are various tools to help with this. That starts with creating an event loop using asyncio, and AIOHTTP will then allow you to create asynchronous HTTP requests within the look you just created.
To do this, you need to install the necessary libraries. Here are the steps you can take from your terminal to do so.
Install AIOHTTP. First, install the AIOHTTP library. To do this, use this command:
pip install aiohttp
Then, obtain all required dependencies. This typically includes AIOHTTP, as well as asyncio and time. This will create the asynchronous main function. To do that, we encourage you to use the following commands in scraper.py:
import aiohttp import asyncio import time async def main(): #..
Here is an important bit of information. Both asyncio and time are components of the standard library from Python. As a result, you will not have to install these separately.
Now, consider this process using a made-up URL. Within the main function, we need to define the target URL. Let’s say that is http://www.scrapingclasses.com/ecommerce. You will then need to record the time at the start and then at the end to track the overall performance of the scrape.
Once you take these steps, you then need to include your scraping logic. To help you understand the process, we encourage you to use the fetch function. You will then need to track the current time and perform the necessary steps to calculate the time it took.
Try out this code to get a feel for what to expect:
async def main(): url = "https://www.scrapingclasses.com/ecommerce/" # time Tracking: Start Time start_time = time.time() # fetching content asynchronously content = await fetch(url) # printing content print(content) # time Tracking: End Time end_time = time.time() # calculating and printing the time taken print(f"Time taken: {end_time - start_time} seconds")
From here, we need to define the fetch function. This is where you are going to create the asynchronous component. To do this, within the function, create an asynchronous HTTP session. You can use aiohttp.ClientSession as the code. Then, use the session to create a GET request. This request should go to the target URL. Once you do this, retrieve and return the text content.
Consider running this code next to see how it will work for you:
async def fetch(url): # create HTTP session async with aiohttp.ClientSession() as session: # make GET request using session async with session.get(url) as response: # return text content return await response.text()
Now, let’s create a single bit of code that addresses your specific needs. Here is how the execution of the main() function will work using asyncio.run(main()).
If you wrote the code following our details and kept the information the same, you will have the following complete code to work with:
import aiohttp import asyncio import time async def main(): url = "https://www.scrapingclasses.com/ecommerce/" # time Tracking: Start Time start_time = time.time() # fetching content asynchronously content = await fetch(url) # printing content print(content) # time Tracking: End Time end_time = time.time() # calculating and printing the time taken print(f"Time taken: {end_time - start_time} seconds") async def fetch(url): # create HTTP session async with aiohttp.ClientSession() as session: # make GET request using session async with session.get(url) as response: # return text content return await response.text() # run the main function asyncio.run(main())
Now that you have built this bit of code, you can test it out. Go ahead and run it. When you do, you should seek a test that functions in the desired manner.
This process creates just one asynchronous request (you can see how this could become rather complicated over time). However, Python Async web scraping can go much further and be used for more than one task.
Scraping Multiple Pages Asynchronously
What if you want to use Async web scraping in Python to help you scrape multiple pages at one time? You can do that. This would be the next step in getting more accomplished in less time.
In order to scrape more than one page, you will need to build separate tasks for each of the URLs you want to scrape. Once you do that, you can then group them using asyncio.gather().
When you do this, each of the tasks represents an async operation. Each one will retrieve data from a specific page, as you noted. Then, once you do that and group them, you can set them to be executed concurrently at the same time.
This is an extension of the task you already learned. Remember, in the previous example, you created a code and process to capture valuable data. Now, we want to retrieve data from a specific page on multiple websites.
To do this, start by defining a function that will take a session. Then, a URL will have parameters to retrieve data from the page. With this function, you should create a GET request. Use the AIOHTTP session and then return the HTML content.
Here is an example of what you can create using this method:
async def fetch_page(session, url): # make GET request using session async with session.get(url) as response: # return HTML content return await response.text()
Once this is created, you then need to provide specific directions or details. This is done within the main() function. You will need to initialize a list of the URLs you want to target within the main() function. You will need to record the time at the start and then create an AIOTTP session before initializing the task list. Confused? Here’s what your code would look like – just update the URLs to match the sites you need to go to.
async def main(): # initialize a list of URLs urls = [ "https://www.scrapingclasses.com/ecommerce/", "https://www.scrapingclasses.com.com/ecommerce/page/2/", "https://www.scrapingclasses.com/ecommerce/page/3/" ] # time Tracking: Start Time start_time = time.time() # create an AIOHTTP session async with aiohttp.ClientSession() as session: # Initialize tasks list tasks = []
Now that you’ve gotten this far, it’s only a few more steps to using Python async web scraping to capture the information you need. Your next step is to loop through URLs to create a separate task for each of them. Then, append it to the task list. You will finally need to group the tasks. Use the asyncio.gather() to execute them concurrently.
Here is what the code should look like when you complete this step in the process:
#.. async with aiohttp.ClientSession() as session: #.. # loop through URLs and append tasks for url in urls: tasks.append(fetch_page(session, url)) # group and Execute tasks concurrently htmls = await asyncio.gather(*tasks)
Now that you have that complete, you will need to track the current time and processing of the HTML responses. When you take a look at the following details, your code should be exactly the same:
import aiohttp import asyncio import time async def fetch_page(session, url): # make GET request using session async with session.get(url) as response: # return HTML content return await response.text() async def main(): # initialize a list of URLs urls = [ "https://www.scrapingclasses.com/ecommerce", "https://www.scrapingclasses.com/ecommerce/page/2/", "https://www.scrapingclasses.com/ecommerce/page/3/" ] # time Tracking: Start Time start_time = time.time() # create an AIOHTTP session async with aiohttp.ClientSession() as session: # initialize tasks list tasks = [] # loop through URLs and append tasks for url in urls: tasks.append(fetch_page(session, url)) # group and Execute tasks concurrently htmls = await asyncio.gather(*tasks) # time Tracking: End Time end_time = time.time() # print or process the fetched HTML content for url, html in zip(urls, htmls): print(f"Content from {url}:\n{html}\n") # calculate and print the time taken print(f"Time taken: {end_time - start_time} seconds") # run the main function asyncio.run(main())
Why Use Asynchronous Web Scraping with Python?
There are numerous ways to capture data, and with tools like Python async web scraping, you can get a lot of work done without the risk of taking too much time. For example, when combined with frameworks like Scrapy’s asyncio, you can manage the high-scale scraping tasks you need without any delay and with minimal latency. This can be critical in some situations, including when you have a large dataset to navigate, are working through dynamic websites, or have other scenarios that require high performance. This will provide you with both speed and scalability in your web scraping operations.
Let Rayobyte Improve the Process
You already know the value of Python asynchronous web server creation, but with Rayobyte, you can take the process to the next level by adding more protection. Utilize our proxies to help you safeguard your scraping activities and minimize the risk of being shut down. Contact Rayobyte now to learn more about how we can help you improve asynchronous web scraping with Python.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.