News Feed Forums General Web Scraping Scrape product reviews, pricing, and categories from Currys UK with Python

  • Scrape product reviews, pricing, and categories from Currys UK with Python

    Posted by Michael Woo on 12/05/2024 at 10:51 am

    Scraping data from Currys UK, a leading electronics retailer, involves extracting key information like product reviews, pricing, and categories to build insights or automate certain workflows. This process is done using Python, where libraries like requests and BeautifulSoup come into play. The first step is to identify the URL structure for the pages you want to scrape. This is usually done by visiting a few product pages and observing patterns in the URLs, such as whether the pages are static or dynamic.
    Next, you need to inspect the webpage source (using browser developer tools) to identify the tags and classes associated with the data you wish to extract. Reviews are often stored in a section separate from the main product description, while pricing and categories might be directly embedded within the product details section. It’s important to handle pagination if the reviews span multiple pages.
    One of the challenges with scraping reviews is ensuring that dynamically loaded content (rendered using JavaScript) is handled properly. If the required data isn’t present in the HTML response from requests, you may need to use Selenium or analyze network activity for API calls that fetch this data. For simplicity in this example, we’ll focus on scraping static HTML content.
    After fetching the HTML content with the requests library, BeautifulSoup is used to parse and navigate the document tree. This allows us to locate and extract data using tags and attributes, such as product names, prices, reviews, and associated categories. Once extracted, the data can be stored in a structured format like a CSV or database for further processing. For instance, you might want to analyze the reviews to determine customer sentiment or study pricing trends.
    Below is the complete Python script using requests and BeautifulSoup for scraping reviews, pricing, and categories from Currys UK:

    import requests
    from bs4 import BeautifulSoup
    import csv
    # URL of the product page
    url = "https://www.currys.co.uk/products/your-product-url"
    # Headers to mimic a real browser
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    # Send a GET request to the page
    response = requests.get(url, headers=headers)
    # Check if the request was successful
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        # Scrape product name
        product_name = soup.find("h1", class_="product-title").text.strip()
        print("Product Name:", product_name)
        # Scrape price
        price = soup.find("span", class_="price").text.strip()
        print("Price:", price)
        # Scrape category
        category = soup.find("a", class_="breadcrumb-link").text.strip()
        print("Category:", category)
        # Scrape reviews
        reviews_section = soup.find("div", class_="reviews-section")
        if reviews_section:
            reviews = reviews_section.find_all("p", class_="review-text")
            for idx, review in enumerate(reviews, 1):
                print(f"Review {idx}:", review.text.strip())
        # Save to CSV
        with open("currys_data.csv", "w", newline="", encoding="utf-8") as file:
            writer = csv.writer(file)
            writer.writerow(["Product Name", "Price", "Category", "Reviews"])
            review_texts = [review.text.strip() for review in reviews] if reviews_section else ["No reviews"]
            writer.writerow([product_name, price, category, " | ".join(review_texts)])
    else:
        print(f"Failed to fetch the page. Status code: {response.status_code}")
    
    Elio Helen replied 5 days, 2 hours ago 5 Members · 4 Replies
  • 4 Replies
  • Ahmose Tetty

    Member
    12/13/2024 at 8:21 am

    The script could be improved by implementing error handling for cases where the desired HTML element does not exist on the page. For instance, if the product has no reviews, an exception might be raised when accessing reviews_section. Adding conditional checks or try-except blocks would make the code more robust.

  • Abidan Grete

    Member
    12/13/2024 at 10:04 am

    Another improvement would be to implement pagination for extracting reviews. The current implementation only scrapes the first page of reviews. Adding a loop to navigate through all pages of reviews would ensure a more comprehensive data set.

  • Heiko Nanda

    Member
    12/14/2024 at 7:34 am

    The script could also benefit from extracting additional metadata, such as product ratings or discount details. This would provide a more complete data set for analysis, especially for businesses interested in competitive pricing.

  • Elio Helen

    Member
    12/17/2024 at 6:04 am

    Finally, saving the extracted data to a database instead of a CSV file would improve scalability. Using a library like sqlite3 or integrating with a cloud-based database would allow better management and querying of the scraped data.

Log in to reply.