News Feed Forums General Web Scraping Use Node.js to scrape product titles from Books.com.tw

  • Use Node.js to scrape product titles from Books.com.tw

    Posted by Jochem Gunvor on 12/14/2024 at 6:51 am

    How would you scrape product titles from Books.com.tw, one of Taiwan’s most popular online bookstores, considering that the site is written in Chinese? Does the presence of Chinese characters in the webpage content or attributes require additional handling? Would UTF-8 encoding be sufficient to ensure that the characters are parsed and displayed correctly, or is further processing needed?
    Scraping a site like Books.com.tw requires attention to character encoding and the structure of the webpage. Product titles are typically displayed prominently near the top of the product page and are often encapsulated in specific HTML tags, such as h1 or span. Using Node.js with Puppeteer, the script can ensure that dynamically loaded content is fully rendered before scraping. The following implementation demonstrates how to scrape product titles while handling Chinese characters correctly:

    const puppeteer = require('puppeteer');
    (async () => {
        const browser = await puppeteer.launch({ headless: true });
        const page = await browser.newPage();
        // Navigate to the Books.com.tw product page
        await page.goto('https://www.books.com.tw/products/product-id', { waitUntil: 'networkidle2' });
        // Wait for the product title to load
        await page.waitForSelector('.product-title');
        // Extract the product title
        const productTitle = await page.evaluate(() => {
            const titleElement = document.querySelector('.product-title');
            return titleElement ? titleElement.innerText.trim() : 'Title not found';
        });
        console.log('Product Title:', productTitle);
        await browser.close();
    })();
    
    Ricardo Urbain replied 1 day, 19 hours ago 5 Members · 4 Replies
  • 4 Replies
  • Mildburg Beth

    Member
    12/17/2024 at 9:58 am

    Ensuring proper handling of Chinese characters might require confirming that the content fetched from the site is encoded in UTF-8. Using Puppeteer eliminates encoding issues by simulating a browser session, which natively handles Unicode characters.

  • Nora Ramzan

    Member
    12/18/2024 at 8:41 am

    If the product title element is dynamically loaded, Puppeteer is well-suited for the task. However, inspecting the network requests for API endpoints could reveal a direct way to fetch product data without rendering the entire page.

  • Segundo Jayme

    Member
    12/19/2024 at 11:42 am

    Adding error handling for missing elements or incorrect selectors would improve the script. For example, logging pages with unexpected structures can help refine the scraper for broader usage across Books.com.tw.

  • Ricardo Urbain

    Member
    12/21/2024 at 5:24 am

    Storing the scraped product titles in a structured format like JSON or CSV would facilitate further analysis. Including additional fields, such as product IDs or categories, would make the dataset more comprehensive.

Log in to reply.