How can I scrape product details from JD Central Thailand using Python n Scrapy?

General Web Scraping

How can I scrape product details from JD Central Thailand using Python n Scrapy?

Posted by Khordad Leto on 12/11/2024 at 11:09 am
JD Central Thailand is a major e-commerce platform where you can scrape various product details, such as price, availability, and category, using Scrapy. The first step is to inspect the product page’s HTML structure, as JD Central often uses complex layouts with product listings inside specific div tags. Scrapy’s XPath or CSS selectors can then be used to extract these details. Additionally, some pages may have AJAX-loaded data, so it’s important to ensure that the data you need has been fully loaded before scraping. Once you have the data, you can store it in a database or a CSV file.
```
import scrapy
class JDProductScraper(scrapy.Spider):
    name = 'jd_product_scraper'
    start_urls = ['https://www.jd.co.th/th/products']
    def parse(self, response):
        for product in response.css('div.product-item'):
            title = product.css('div.product-name::text').get()
            price = product.css('span.product-price::text').get()
            availability = product.css('span.availability-status::text').get()
            yield {
                'title': title.strip(),
                'price': price.strip(),
                'availability': availability.strip(),
            }
        next_page = response.css('a.pagination-next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)
```
Aston Martial replied 4 months, 1 week ago 4 Members · 3 Replies
3 Replies

Osman Devaki

Member
12/11/2024 at 12:09 pm
One challenge of scraping Lazada Thailand is handling the rich media content such as images, JavaScript-loaded product details, and other dynamic elements. The static data can be scraped easily using BeautifulSoup, but if the site relies on JavaScript to load the products, you’ll need to find the API calls the site makes to load the data. This approach avoids loading the entire page and is much faster. Here’s an example of how you can extract the product titles and prices directly from the page’s HTML.
```
import requests
from bs4 import BeautifulSoup
url = 'https://www.lazada.co.th/catalog/?q=mobile'
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
products = soup.find_all('div', class_='c2prKC')
for product in products:
    title = product.find('div', {'class': 'c16H9d'}).text.strip()
    price = product.find('span', {'class': 'c13VH6'}).text.strip()
    print(f'Product: {title}, Price: {price}')
```

Lalitha Kreka

Member

12/12/2024 at 8:09 am

To scrape product details from JD Central Thailand, you’ll need to target the HTML elements containing product names, prices, and availability. JD Central’s product listings typically contain structured data inside specific class attributes. Scrapy makes it easy to extract this data using CSS selectors. Handling pagination correctly is essential to scrape products across multiple pages, which can be done using Scrapy’s response.follow() method to navigate through links.

import scrapy
class JDSpider(scrapy.Spider):
    name = 'jd_spider'
    start_urls = ['https://www.jd.co.th/th/search?query=laptop']
    def parse(self, response):
        for product in response.xpath('//div[@class="product-item"]'):
            title = product.xpath('.//div[@class="product-name"]/text()').get()
            price = product.xpath('.//span[@class="product-price"]/text()').get()
            availability = product.xpath('.//span[@class="product-availability"]/text()').get()
            yield {
                'title': title.strip(),
                'price': price.strip(),
                'availability': availability.strip(),
            }
        next_page = response.xpath('//a[@class="next"]/@href').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Aston Martial

Member

12/12/2024 at 10:37 am

JD Central Thailand utilizes JavaScript and dynamic content loading for product information, so it’s important to work with Scrapy’s middleware settings to handle AJAX requests. Scrapy is excellent for scraping static HTML content, but you may need to integrate it with other tools like Splash if dealing with pages that require JavaScript rendering. By using XPath selectors, you can easily extract the desired product data such as name, price, and availability from JD Central’s HTML structure.

import scrapy
class JDSpider(scrapy.Spider):
    name = 'jd_spider'
    start_urls = ['https://www.jd.co.th/th/products']
    def parse(self, response):
        for product in response.xpath('//div[@class="product-card"]'):
            name = product.xpath('.//h3[@class="product-name"]/text()').get()
            price = product.xpath('.//span[@class="price"]/text()').get()
            availability = product.xpath('.//span[@class="availability-status"]/text()').get()
            yield {
                'name': name.strip(),
                'price': price.strip(),
                'availability': availability.strip(),
            }
        next_page = response.xpath('//a[@class="next-page"]/@href').get()
        if next_page:
            yield response.follow(next_page, self.parse)

How can I scrape product details from JD Central Thailand using Python n Scrapy?

Osman Devaki

Lalitha Kreka

Aston Martial