General Web Scraping

Compare Python and Ruby for scraping product reviews on Tiki Vietnam

Posted by Aretha Melech on 12/14/2024 at 8:37 am

How does scraping product reviews from Tiki, one of Vietnam’s largest e-commerce platforms, differ between Python and Ruby? Would Python’s BeautifulSoup library be more efficient for parsing static HTML, or does Ruby’s Nokogiri offer a simpler and more elegant solution? How do both languages handle dynamic content, such as paginated reviews or JavaScript-rendered elements?
Below are two implementations—one in Python and one in Ruby—for scraping product reviews from a Tiki product page. Which approach better handles the site’s structure and ensures accurate data extractionPython Implementation:

import requests
from bs4 import BeautifulSoup
# URL of the Tiki product page
url = "https://tiki.vn/product-page"
# Headers to mimic a browser request
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
# Fetch the page content
response = requests.get(url, headers=headers)
if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")
    # Extract reviews
    reviews = soup.find_all("div", class_="review-item")
    for idx, review in enumerate(reviews, 1):
        reviewer = review.find("span", class_="reviewer-name").text.strip() if review.find("span", class_="reviewer-name") else "Anonymous"
        comment = review.find("p", class_="review-text").text.strip() if review.find("p", class_="review-text") else "No comment"
        print(f"Review {idx}: {reviewer} - {comment}")
else:
    print(f"Failed to fetch the page. Status code: {response.status_code}")

Ruby Implementation:

require 'nokogiri'
require 'open-uri'
# URL of the Tiki product page
url = 'https://tiki.vn/product-page'
# Fetch the page content
doc = Nokogiri::HTML(URI.open(url))
# Scrape reviews
reviews = doc.css('.review-item')
if reviews.any?
  reviews.each_with_index do |review, index|
    reviewer = review.at_css('.reviewer-name')&.text&.strip || 'Anonymous'
    comment = review.at_css('.review-text')&.text&.strip || 'No comment'
    puts "Review #{index + 1}: #{reviewer} - #{comment}"
  end
else
  puts "No reviews found."
end

Margery Roxana replied 3 months, 2 weeks ago 5 Members · 4 Replies

4 Replies

Shakti Siria

Member
12/18/2024 at 10:27 am

Python’s BeautifulSoup is highly efficient for parsing static HTML, making it a great choice for smaller tasks. However, it may struggle with dynamic content unless combined with tools like Selenium for JavaScript rendering.
Lilla Roma

Member
12/21/2024 at 6:00 am

Ruby’s Nokogiri is simple and intuitive for static content scraping, but like Python, it requires additional libraries or tools, such as Watir, to handle JavaScript-heavy pages or dynamic content.
Rayan Todorka

Member
12/21/2024 at 6:34 am

Both Python and Ruby would require enhancements for paginated reviews. By iterating over the “Next Page” button, the scripts could collect reviews across multiple pages for a more comprehensive dataset.
Margery Roxana

Member
12/21/2024 at 6:52 am

For large-scale scraping, Python offers better scalability due to its rich ecosystem of libraries and frameworks. Ruby, while powerful, may require more manual effort for handling advanced scraping tasks involving concurrency.

Compare Python and Ruby for scraping product reviews on Tiki Vietnam

Shakti Siria

Lilla Roma

Rayan Todorka

Margery Roxana