Welcome to Rayobyte University’s guide on Request Handling in Scrapy! Effective request handling in Scrapy empowers you to simulate real user interactions, manage cookies, customize headers, and handle various server responses—ensuring smoother and more reliable data extraction.
Handling requests properly allows your Scrapy spiders to function more naturally and interact with websites in a way that avoids detection or blocks. By customizing requests, you can:
Let’s explore these capabilities in detail, along with code examples.
Websites often inspect the User-Agent, Accept-Language, and Referer headers to determine if requests are from real users or bots. Customizing these headers helps your requests appear legitimate, making it harder for websites to block your scraper.
Example: Setting a custom User-Agent
and other headers
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = ['http://example.com']
def start_requests(self):
# Custom headers to mimic a real browser
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9',
'Referer': 'http://google.com' # Simulating access via Google search
}
for url in self.start_urls:
yield scrapy.Request(url=url, headers=headers, callback=self.parse)
def parse(self, response):
# Parsing logic for the main page
yield {'title': response.css('title::text').get()}
Explanation:
User-Agent
header identifies the browser and operating system, helping the server treat your request as if it came from a typical browser.Accept-Language
sets the preferred language for content, while Referer
can indicate a referral source like Google, adding credibility to the request.Cookies store session data and are essential for tasks like logging into websites. Scrapy handles cookies automatically, but you can also send custom cookies when a session or authentication is required.
Example: Sending custom cookies to maintain a session
def start_requests(self):
# Custom cookies for session management
cookies = {
'sessionid': 'abc123',
'preferences': 'user_settings'
}
for url in self.start_urls:
yield scrapy.Request(url=url, cookies=cookies, callback=self.parse)
def parse(self, response):
# Parsing logic with session-specific data
yield {'user_data': response.css('.user-info::text').get()}
Explanation:
sessionid
and preferences
) are sent with each request to maintain a logged-in session or user preferences.Callback functions let you control the flow of scraping by setting up sequential or multi-page requests. This is useful when you want to gather additional information from linked pages or process data across multiple steps.
Example: Using callback functions to follow links and extract additional data
def parse(self, response):
# Extract basic data and follow a link to get more details
item = {'title': response.css('h1::text').get()}
detail_url = response.css('a.details::attr(href)').get()
if detail_url:
# Pass item to the next callback for additional data
yield response.follow(detail_url, self.parse_detail, meta={'item': item})
def parse_detail(self, response):
# Retrieve item from meta data and add more details
item = response.meta['item']
item['details'] = response.css('.detail-info::text').get()
yield item
Explanation:
parse
method extracts basic data and follows a link for additional information.response.follow
sets up parse_detail
as the callback, allowing additional data to be captured from the linked page and combined with the initial data in item
.Websites sometimes return non-200 status codes (e.g., 404 Not Found, 403 Forbidden) due to server-side issues or bot restrictions. Scrapy allows you to detect these status codes and define custom handling strategies.
Example: Handling non-200 responses with logging and retries
def parse(self, response):
if response.status == 200:
# Process the response normally
yield {'content': response.text}
elif response.status == 404:
self.logger.warning(f"Page not found: {response.url}")
elif response.status == 403:
self.logger.warning(f"Access forbidden: {response.url}")
else:
self.logger.warning(f"Unexpected status {response.status} for {response.url}")
Explanation:
response.status
is 200, it means the page loaded successfully, and normal parsing proceeds.Advanced request handling in Scrapy opens up possibilities for resilient, high-quality scraping. By customizing headers, managing cookies, using callbacks, and handling status codes, you’re equipped to tackle real-world scraping challenges. Join us in the next session to explore Advanced Scrapy Techniques and expand your skills further. Happy scraping!
‍
Our community is here to support your growth, so why wait? Join now and let’s build together!