News Feed Forums General Web Scraping How to handle multi-page scraping with pagination in Python?

  • How to handle multi-page scraping with pagination in Python?

    Posted by Mildburg Beth on 12/17/2024 at 9:57 am

    Scraping data across multiple pages can be challenging, especially when dealing with pagination. The key is to identify how the website handles its “Next Page” button or pagination links. For some sites, the URL changes with each page (e.g., adding ?page=2 to the URL), while others might rely on JavaScript to load more content dynamically. How do you handle these differences effectively?
    One method is to extract the pagination links from the HTML and follow them programmatically. Python’s requests library and BeautifulSoup are well-suited for this. Let’s look at an example:

    import requests
    from bs4 import BeautifulSoup
    base_url = "https://example.com/products"
    headers = {"User-Agent": "Mozilla/5.0"}
    page = 1
    while True:
        response = requests.get(f"{base_url}?page={page}", headers=headers)
        if response.status_code != 200:
            print("No more pages or failed to fetch.")
            break
        soup = BeautifulSoup(response.content, "html.parser")
        products = soup.find_all("div", class_="product-item")
        if not products:
            print("No products found on this page.")
            break
        for product in products:
            name = product.find("h2", class_="product-title").text.strip()
            price = product.find("span", class_="product-price").text.strip()
            print(f"Name: {name}, Price: {price}")
        page += 1
    

    This example works for URLs that include page numbers, but what about infinite scrolling or AJAX-based pagination? You’d need a tool like Selenium or Puppeteer to simulate scrolling or clicking “Load More” buttons. For APIs, inspecting the network traffic can reveal the endpoints to fetch data directly.
    Have you ever struggled with multi-page scraping, and do you prefer scraping HTML or using APIs?

    Mildburg Beth replied 5 days, 20 hours ago 1 Member · 0 Replies
  • 0 Replies

Sorry, there were no replies found.

Log in to reply.