General Web Scraping

How to scrape movie ratings and reviews from RottenTomatoes.com using Python?

Posted by Laurids Liljana on 12/17/2024 at 7:09 am

Scraping movie ratings and reviews from RottenTomatoes.com is an excellent way to analyze audience feedback, review trends, and critic scores for films. Python, along with libraries like BeautifulSoup and requests, can be used to scrape static content from the site. If the reviews or ratings are dynamically loaded, Selenium can help render the JavaScript content for scraping. RottenTomatoes structures its data in a systematic way, with dedicated sections for audience reviews, critic reviews, and movie details, making it straightforward to target specific data points.
Before starting, use the developer tools in your browser to inspect the webpage. Identify the HTML tags and classes that house the ratings and reviews. This will guide your scraping script. Here’s an example of scraping static content using BeautifulSoup:

import requests
from bs4 import BeautifulSoup
# Target URL for a specific movie
url = "https://www.rottentomatoes.com/m/example_movie"
headers = {
    "User-Agent": "Mozilla/5.0"
}
# Fetch the page
response = requests.get(url, headers=headers)
if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")
    # Extract movie details
    movie_title = soup.find("h1", class_="scoreboard__title").text.strip()
    critic_score = soup.find("span", class_="scoreboard__score").text.strip()
    audience_score = soup.find("span", class_="scoreboard__percentage").text.strip()
    print(f"Movie: {movie_title}")
    print(f"Critic Score: {critic_score}")
    print(f"Audience Score: {audience_score}")
    # Extract audience reviews
    reviews = soup.find_all("div", class_="audience-review")
    for review in reviews[:5]:  # Limit to first 5 reviews
        review_text = review.find("p", class_="audience-review__text").text.strip()
        print(f"Review: {review_text}")
else:
    print("Failed to fetch Rotten Tomatoes page.")

This script extracts the movie title, critic score, audience score, and a few audience reviews. If the reviews are dynamically loaded, Selenium is a better alternative. Here’s an example:

from selenium import webdriver
from selenium.webdriver.common.by import By
# Initialize Selenium WebDriver
driver = webdriver.Chrome()
driver.get("https://www.rottentomatoes.com/m/example_movie")
# Wait for the page to load
driver.implicitly_wait(10)
# Extract movie details
movie_title = driver.find_element(By.CLASS_NAME, "scoreboard__title").text.strip()
critic_score = driver.find_element(By.CLASS_NAME, "scoreboard__score").text.strip()
audience_score = driver.find_element(By.CLASS_NAME, "scoreboard__percentage").text.strip()
print(f"Movie: {movie_title}")
print(f"Critic Score: {critic_score}")
print(f"Audience Score: {audience_score}")
# Extract audience reviews
reviews = driver.find_elements(By.CLASS_NAME, "audience-review")
for review in reviews[:5]:  # Limit to first 5 reviews
    review_text = review.find_element(By.CLASS_NAME, "audience-review__text").text.strip()
    print(f"Review: {review_text}")
# Close the browser
driver.quit()

In both examples, it’s important to include headers in your requests to mimic a browser and avoid being flagged as a bot. If you want to scrape reviews for multiple movies, you can create a loop that navigates through movie URLs or categories.
For long-term scraping projects, storing the data in a structured format, such as a CSV file or database, is recommended. Libraries like pandas make it easy to write data to a CSV file, while databases like SQLite or PostgreSQL allow for efficient querying and analysis.

Soma Danilo replied 3 months, 2 weeks ago 5 Members · 4 Replies

4 Replies

Roi Garrett

Member
12/17/2024 at 11:48 am

To enhance the scraper, you can add functionality to handle pagination. Audience reviews on RottenTomatoes are often split across multiple pages, and fetching all reviews requires following the “Next” button. Using Selenium, you can simulate clicking the button and scrape additional pages until all reviews are collected. Adding delays between requests ensures that your scraper mimics human behavior and avoids being flagged.
Lisbet Verica

Member
12/21/2024 at 10:25 am

Using proxy rotation is essential for scraping large datasets without being blocked. RottenTomatoes may restrict access if it detects repeated requests from the same IP address. Integrating a proxy service into your scraper allows you to distribute requests across multiple IPs, reducing the likelihood of detection. Combining this with randomized headers further helps evade detection mechanisms.
Julia Karthika

Member
12/21/2024 at 11:01 am

Storing the scraped data in a database like MongoDB or MySQL allows for efficient data management. For example, you can run queries to find the most-reviewed movies or calculate the average audience score for a specific genre. Structured storage also makes it easier to visualize the data using tools like Tableau or Matplotlib, providing deeper insights into movie trends.
Soma Danilo

Member
12/21/2024 at 11:11 am

Handling missing data or unexpected changes in the site’s structure is crucial for building a robust scraper. Websites like RottenTomatoes frequently update their layouts, which can break hardcoded scripts. Using flexible selectors that rely on attributes or patterns rather than static class names can mitigate this issue. Regularly testing and logging the scraper’s performance ensures it remains functional over time.

How to scrape movie ratings and reviews from RottenTomatoes.com using Python?

Roi Garrett

Lisbet Verica

Julia Karthika

Soma Danilo