-
Use Node.js to scrape product titles from Books.com.tw
How would you scrape product titles from Books.com.tw, one of Taiwan’s most popular online bookstores, considering that the site is written in Chinese? Does the presence of Chinese characters in the webpage content or attributes require additional handling? Would UTF-8 encoding be sufficient to ensure that the characters are parsed and displayed correctly, or is further processing needed?
Scraping a site like Books.com.tw requires attention to character encoding and the structure of the webpage. Product titles are typically displayed prominently near the top of the product page and are often encapsulated in specific HTML tags, such as h1 or span. Using Node.js with Puppeteer, the script can ensure that dynamically loaded content is fully rendered before scraping. The following implementation demonstrates how to scrape product titles while handling Chinese characters correctly:const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch({ headless: true }); const page = await browser.newPage(); // Navigate to the Books.com.tw product page await page.goto('https://www.books.com.tw/products/product-id', { waitUntil: 'networkidle2' }); // Wait for the product title to load await page.waitForSelector('.product-title'); // Extract the product title const productTitle = await page.evaluate(() => { const titleElement = document.querySelector('.product-title'); return titleElement ? titleElement.innerText.trim() : 'Title not found'; }); console.log('Product Title:', productTitle); await browser.close(); })();
Log in to reply.