Scrape product reviews, pricing, and categories from Currys UK with Python

Michael Woo · 2024-12-05T10:51:19+00:00

Scraping data from Currys UK, a leading electronics retailer, involves extracting key information like product reviews, pricing, and categories to build insights or automate certain workflows. This process is done using Python, where libraries like requests and BeautifulSoup come into play. The first step is to identify the URL structure for the pages you want to scrape. This is usually done by visiting a few product pages and observing patterns in the URLs, such as whether the pages are static or dynamic.Next, you need to inspect the webpage source (using browser developer tools) to identify the tags and classes associated with the data you wish to extract. Reviews are often stored in a section separate from the main product description, while pricing and categories might be directly embedded within the product details section. It's important to handle pagination if the reviews span multiple pages.One of the challenges with scraping reviews is ensuring that dynamically loaded content (rendered using JavaScript) is handled properly. If the required data isn’t present in the HTML response from requests, you may need to use Selenium or analyze network activity for API calls that fetch this data. For simplicity in this example, we'll focus on scraping static HTML content.After fetching the HTML content with the requests library, BeautifulSoup is used to parse and navigate the document tree. This allows us to locate and extract data using tags and attributes, such as product names, prices, reviews, and associated categories. Once extracted, the data can be stored in a structured format like a CSV or database for further processing. For instance, you might want to analyze the reviews to determine customer sentiment or study pricing trends.Below is the complete Python script using requests and BeautifulSoup for scraping reviews, pricing, and categories from Currys UK:import requestsfrom bs4 import BeautifulSoupimport csv# URL of the product pageurl "https://www.currys.co.uk/products/your-product-url"# Headers to mimic a real browserheaders { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}# Send a GET request to the pageresponse requests.get(url, headersheaders)# Check if the request was successfulif response.status_code 200: soup BeautifulSoup(response.content, "html.parser") # Scrape product name product_name soup.find("h1", class_"product-title").text.strip() print("Product Name:", product_name) # Scrape price price soup.find("span", class_"price").text.strip() print("Price:", price) # Scrape category category soup.find("a", class_"breadcrumb-link").text.strip() print("Category:", category) # Scrape reviews reviews_section soup.find("div", class_"reviews-section") if reviews_section: reviews reviews_section.find_all("p", class_"review-text") for idx, review in enumerate(reviews, 1): print(f"Review {idx}:", review.text.strip()) # Save to CSV with open("currys_data.csv", "w", newline"", encoding"utf-8") as file: writer csv.writer(file) writer.writerow() review_texts if reviews_section else writer.writerow()else: print(f"Failed to fetch the page. Status code: {response.status_code}")

General Web Scraping

Scrape product reviews, pricing, and categories from Currys UK with Python

Posted by Michael Woo on 12/05/2024 at 10:51 am
Scraping data from Currys UK, a leading electronics retailer, involves extracting key information like product reviews, pricing, and categories to build insights or automate certain workflows. This process is done using Python, where libraries like requests and BeautifulSoup come into play. The first step is to identify the URL structure for the pages you want to scrape. This is usually done by visiting a few product pages and observing patterns in the URLs, such as whether the pages are static or dynamic.
Next, you need to inspect the webpage source (using browser developer tools) to identify the tags and classes associated with the data you wish to extract. Reviews are often stored in a section separate from the main product description, while pricing and categories might be directly embedded within the product details section. It’s important to handle pagination if the reviews span multiple pages.
One of the challenges with scraping reviews is ensuring that dynamically loaded content (rendered using JavaScript) is handled properly. If the required data isn’t present in the HTML response from requests, you may need to use Selenium or analyze network activity for API calls that fetch this data. For simplicity in this example, we’ll focus on scraping static HTML content.
After fetching the HTML content with the requests library, BeautifulSoup is used to parse and navigate the document tree. This allows us to locate and extract data using tags and attributes, such as product names, prices, reviews, and associated categories. Once extracted, the data can be stored in a structured format like a CSV or database for further processing. For instance, you might want to analyze the reviews to determine customer sentiment or study pricing trends.
Below is the complete Python script using requests and BeautifulSoup for scraping reviews, pricing, and categories from Currys UK:
```
import requests
from bs4 import BeautifulSoup
import csv
# URL of the product page
url = "https://www.currys.co.uk/products/your-product-url"
# Headers to mimic a real browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
# Send a GET request to the page
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")
    # Scrape product name
    product_name = soup.find("h1", class_="product-title").text.strip()
    print("Product Name:", product_name)
    # Scrape price
    price = soup.find("span", class_="price").text.strip()
    print("Price:", price)
    # Scrape category
    category = soup.find("a", class_="breadcrumb-link").text.strip()
    print("Category:", category)
    # Scrape reviews
    reviews_section = soup.find("div", class_="reviews-section")
    if reviews_section:
        reviews = reviews_section.find_all("p", class_="review-text")
        for idx, review in enumerate(reviews, 1):
            print(f"Review {idx}:", review.text.strip())
    # Save to CSV
    with open("currys_data.csv", "w", newline="", encoding="utf-8") as file:
        writer = csv.writer(file)
        writer.writerow(["Product Name", "Price", "Category", "Reviews"])
        review_texts = [review.text.strip() for review in reviews] if reviews_section else ["No reviews"]
        writer.writerow([product_name, price, category, " | ".join(review_texts)])
else:
    print(f"Failed to fetch the page. Status code: {response.status_code}")
```
Elio Helen replied 3 months, 2 weeks ago 5 Members · 4 Replies
4 Replies

Ahmose Tetty

Member
12/13/2024 at 8:21 am

The script could be improved by implementing error handling for cases where the desired HTML element does not exist on the page. For instance, if the product has no reviews, an exception might be raised when accessing reviews_section. Adding conditional checks or try-except blocks would make the code more robust.
Abidan Grete

Member
12/13/2024 at 10:04 am

Another improvement would be to implement pagination for extracting reviews. The current implementation only scrapes the first page of reviews. Adding a loop to navigate through all pages of reviews would ensure a more comprehensive data set.
Heiko Nanda

Member
12/14/2024 at 7:34 am

The script could also benefit from extracting additional metadata, such as product ratings or discount details. This would provide a more complete data set for analysis, especially for businesses interested in competitive pricing.
Elio Helen

Member
12/17/2024 at 6:04 am

Finally, saving the extracted data to a database instead of a CSV file would improve scalability. Using a library like sqlite3 or integrating with a cloud-based database would allow better management and querying of the scraped data.

Scrape product reviews, pricing, and categories from Currys UK with Python

Ahmose Tetty

Abidan Grete

Heiko Nanda

Elio Helen