Use Node.js to scrape product titles from Books.com.tw

Jochem Gunvor · 2024-12-14T06:51:17+00:00

How would you scrape product titles from Books.com.tw, one of Taiwan's most popular online bookstores, considering that the site is written in Chinese? Does the presence of Chinese characters in the webpage content or attributes require additional handling? Would UTF-8 encoding be sufficient to ensure that the characters are parsed and displayed correctly, or is further processing needed?Scraping a site like Books.com.tw requires attention to character encoding and the structure of the webpage. Product titles are typically displayed prominently near the top of the product page and are often encapsulated in specific HTML tags, such as h1 or span. Using Node.js with Puppeteer, the script can ensure that dynamically loaded content is fully rendered before scraping. The following implementation demonstrates how to scrape product titles while handling Chinese characters correctly:const puppeteer require('puppeteer'); (async () > { const browser await puppeteer.launch({ headless: true }); const page await browser.newPage(); // Navigate to the Books.com.tw product page await page.goto('https://www.books.com.tw/products/product-id', { waitUntil: 'networkidle2' }); // Wait for the product title to load await page.waitForSelector('.product-title'); // Extract the product title const productTitle await page.evaluate(() > { const titleElement document.querySelector('.product-title'); return titleElement ? titleElement.innerText.trim() : 'Title not found'; }); console.log('Product Title:', productTitle); await browser.close();})();

General Web Scraping

Use Node.js to scrape product titles from Books.com.tw

Posted by Jochem Gunvor on 12/14/2024 at 6:51 am
How would you scrape product titles from Books.com.tw, one of Taiwan’s most popular online bookstores, considering that the site is written in Chinese? Does the presence of Chinese characters in the webpage content or attributes require additional handling? Would UTF-8 encoding be sufficient to ensure that the characters are parsed and displayed correctly, or is further processing needed?
Scraping a site like Books.com.tw requires attention to character encoding and the structure of the webpage. Product titles are typically displayed prominently near the top of the product page and are often encapsulated in specific HTML tags, such as h1 or span. Using Node.js with Puppeteer, the script can ensure that dynamically loaded content is fully rendered before scraping. The following implementation demonstrates how to scrape product titles while handling Chinese characters correctly:
```
const puppeteer = require('puppeteer');
(async () => {
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();
    // Navigate to the Books.com.tw product page
    await page.goto('https://www.books.com.tw/products/product-id', { waitUntil: 'networkidle2' });
    // Wait for the product title to load
    await page.waitForSelector('.product-title');
    // Extract the product title
    const productTitle = await page.evaluate(() => {
        const titleElement = document.querySelector('.product-title');
        return titleElement ? titleElement.innerText.trim() : 'Title not found';
    });
    console.log('Product Title:', productTitle);
    await browser.close();
})();
```
Ricardo Urbain replied 3 months, 2 weeks ago 5 Members · 4 Replies
4 Replies

Mildburg Beth

Member
12/17/2024 at 9:58 am

Ensuring proper handling of Chinese characters might require confirming that the content fetched from the site is encoded in UTF-8. Using Puppeteer eliminates encoding issues by simulating a browser session, which natively handles Unicode characters.
Nora Ramzan

Member
12/18/2024 at 8:41 am

If the product title element is dynamically loaded, Puppeteer is well-suited for the task. However, inspecting the network requests for API endpoints could reveal a direct way to fetch product data without rendering the entire page.
Segundo Jayme

Member
12/19/2024 at 11:42 am

Adding error handling for missing elements or incorrect selectors would improve the script. For example, logging pages with unexpected structures can help refine the scraper for broader usage across Books.com.tw.
Ricardo Urbain

Member
12/21/2024 at 5:24 am

Storing the scraped product titles in a structured format like JSON or CSV would facilitate further analysis. Including additional fields, such as product IDs or categories, would make the dataset more comprehensive.

Use Node.js to scrape product titles from Books.com.tw

Mildburg Beth

Nora Ramzan

Segundo Jayme

Ricardo Urbain