-
Goutam Victor replied to the discussion What’s the best approach to scraping PDF documents online? in the forum General Web Scraping a year ago
What’s the best approach to scraping PDF documents online?
I sometimes convert PDFs to HTML before scraping, which allows for easier data extraction, especially with tabular data.
-
Goutam Victor replied to the discussion How do I deal with scraped data that has inconsistent formatting? in the forum General Web Scraping a year ago
How do I deal with scraped data that has inconsistent formatting?
Using custom validation functions to flag outliers ensures consistency, especially with data prone to user input variations.
-
Goutam Victor replied to the discussion How do I scrape data from sites using custom fonts or icons? in the forum General Web Scraping a year ago
How do I scrape data from sites using custom fonts or icons?
Selenium is useful for dynamically loaded fonts, allowing me to capture content in real-time.
-
Goutam Victor replied to the discussion What strategies can I use to scrape websites with limited search functionality? in the forum General Web Scraping a year ago
What strategies can I use to scrape websites with limited search functionality?
Scraping each letter of the alphabet or individual keywords separately is a last resort, but it’s effective on sites with poor search capabilities.
-
Goutam Victor started the discussion How can I use Node.js to scrape product reviews on Bol.com? in the forum General Web Scraping a year ago
How can I use Node.js to scrape product reviews on Bol.com?
Puppeteer in Node.js works well for navigating Bol.com’s review section, allowing you to load dynamic content and interact with pagination.
-
Goutam Victor changed their photo a year ago
-
Goutam Victor became a registered member a year ago
-
Placidus Virgee replied to the discussion How do I handle scraping pages with endless AJAX requests? in the forum General Web Scraping a year ago
How do I handle scraping pages with endless AJAX requests?
Scrapy’s Splash library renders JavaScript, making it easier to handle pages that rely heavily on AJAX for content.
-
Placidus Virgee replied to the discussion What’s the best approach to scraping PDF documents online? in the forum General Web Scraping a year ago
What’s the best approach to scraping PDF documents online?
If the PDFs follow a specific structure, regex helps isolate specific data fields like names, dates, or amounts from the raw text.
-
Placidus Virgee replied to the discussion How do I deal with scraped data that has inconsistent formatting? in the forum General Web Scraping a year ago
How do I deal with scraped data that has inconsistent formatting?
I sometimes find it helpful to group similar data fields, such as phone numbers or names, for bulk formatting and error-checking.
- Load More