

Ketut Hippolytos
Forum Replies Created
-
Ketut Hippolytos
Member12/11/2024 at 11:01 am in reply to: How to fetch property data using Redfin API with Python?For pages that detect Selenium, I use tools like undetected-chromedriver to bypass basic anti-bot measures and ensure smooth scraping.
-
Ketut Hippolytos
Member12/11/2024 at 11:01 am in reply to: How to scrape job postings using Google Jobs API with Node.js?I use rotating proxies and randomized headers to mimic real users. This approach helps avoid detection when scraping data from fingerprinting platforms.
-
Ketut Hippolytos
Member03/18/2025 at 9:05 am in reply to: How to scrape search results using a DuckDuckGo proxy with JavaScript?How to Scrape Search Results Using a DuckDuckGo Proxy with Java?
Web scraping is a crucial technique for extracting information from search engines, and DuckDuckGo is a popular choice due to its privacy-focused approach. By using a proxy, developers can bypass restrictions, prevent IP bans, and maintain anonymity while scraping search results. This guide will explore how to scrape DuckDuckGo search results using Java and a proxy. We will also cover database integration for storing the scraped data efficiently.
Why Scrape Search Results from DuckDuckGo?
DuckDuckGo is widely used for privacy-preserving searches, making it a valuable search engine for research, competitive analysis, and SEO monitoring. Unlike Google, DuckDuckGo does not track users, making it an attractive alternative for gathering unbiased search data.
Advantages of Scraping DuckDuckGo
Scraping DuckDuckGo offers several benefits, including:
- Privacy-Friendly Searches: Since DuckDuckGo does not track user queries, data collection is less likely to be biased.
- Unfiltered Search Results: The search results are not influenced by a user’s previous search history.
- Less Scraping Restrictions: DuckDuckGo has fewer scraping protections than Google, reducing the risk of getting blocked.
- Data Collection for SEO: Businesses can track keyword performance, analyze competitors, and optimize their SEO strategies.
- Academic Research: Researchers can gather data for linguistic studies, sentiment analysis, and trend monitoring.
Setting Up Java for Web Scraping
To scrape search results from DuckDuckGo using Java, we need a few dependencies:
- JSoup: A Java library for parsing and extracting data from HTML.
- Apache HttpClient: A library for making HTTP requests.
- Proxy Configuration: To route requests through a proxy for anonymity.
First, install the required dependencies using Maven:
Plain textCopy to clipboardOpen code in new windowEnlighterJS 3 Syntax Highlighter<dependencies><dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.15.3</version></dependency><dependency><groupId>org.apache.httpcomponents.client5</groupId><artifactId>httpclient5</artifactId><version>5.2</version></dependency></dependencies><dependencies> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.15.3</version> </dependency> <dependency> <groupId>org.apache.httpcomponents.client5</groupId> <artifactId>httpclient5</artifactId> <version>5.2</version> </dependency> </dependencies><dependencies> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.15.3</version> </dependency> <dependency> <groupId>org.apache.httpcomponents.client5</groupId> <artifactId>httpclient5</artifactId> <version>5.2</version> </dependency> </dependencies>
Using a Proxy for Scraping DuckDuckGo
To avoid detection and IP bans, we use a proxy server. DuckDuckGo does not impose heavy restrictions, but it is still best practice to scrape using a rotating proxy.
Configuring a Proxy in Java
We configure a proxy in Java using Apache HttpClient:
Plain textCopy to clipboardOpen code in new windowEnlighterJS 3 Syntax Highlighterimport java.io.IOException;import java.net.*;import org.jsoup.Jsoup;import org.jsoup.nodes.Document;public class DuckDuckGoScraper {public static void main(String[] args) {String searchQuery = "web scraping with Java";String proxyHost = "your.proxy.server";int proxyPort = 8080;try {Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxyHost, proxyPort));URL url = new URL("https://duckduckgo.com/html/?q=" + URLEncoder.encode(searchQuery, "UTF-8"));HttpURLConnection connection = (HttpURLConnection) url.openConnection(proxy);connection.setRequestMethod("GET");connection.setRequestProperty("User-Agent", "Mozilla/5.0");Document doc = Jsoup.parse(connection.getInputStream(), "UTF-8", url.toString());doc.select(".result__title a").forEach(element ->System.out.println("Title: " + element.text() + " | URL: " + element.attr("href")));} catch (IOException e) {e.printStackTrace();}}}import java.io.IOException; import java.net.*; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class DuckDuckGoScraper { public static void main(String[] args) { String searchQuery = "web scraping with Java"; String proxyHost = "your.proxy.server"; int proxyPort = 8080; try { Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxyHost, proxyPort)); URL url = new URL("https://duckduckgo.com/html/?q=" + URLEncoder.encode(searchQuery, "UTF-8")); HttpURLConnection connection = (HttpURLConnection) url.openConnection(proxy); connection.setRequestMethod("GET"); connection.setRequestProperty("User-Agent", "Mozilla/5.0"); Document doc = Jsoup.parse(connection.getInputStream(), "UTF-8", url.toString()); doc.select(".result__title a").forEach(element -> System.out.println("Title: " + element.text() + " | URL: " + element.attr("href")) ); } catch (IOException e) { e.printStackTrace(); } } }import java.io.IOException; import java.net.*; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class DuckDuckGoScraper { public static void main(String[] args) { String searchQuery = "web scraping with Java"; String proxyHost = "your.proxy.server"; int proxyPort = 8080; try { Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxyHost, proxyPort)); URL url = new URL("https://duckduckgo.com/html/?q=" + URLEncoder.encode(searchQuery, "UTF-8")); HttpURLConnection connection = (HttpURLConnection) url.openConnection(proxy); connection.setRequestMethod("GET"); connection.setRequestProperty("User-Agent", "Mozilla/5.0"); Document doc = Jsoup.parse(connection.getInputStream(), "UTF-8", url.toString()); doc.select(".result__title a").forEach(element -> System.out.println("Title: " + element.text() + " | URL: " + element.attr("href")) ); } catch (IOException e) { e.printStackTrace(); } } }
Extracting Search Results
The above script extracts search result titles and URLs from DuckDuckGo’s HTML response. The results are identified using CSS selectors:
- `.result__title a`: Selects the search result titles and URLs.
- `.result__snippet`: Extracts the search result snippets.
Storing Scraped Data in a Database
To store the extracted search results, we use a MySQL database. First, create a database and table:
Plain textCopy to clipboardOpen code in new windowEnlighterJS 3 Syntax HighlighterCREATE DATABASE SearchResultsDB;USE SearchResultsDB;CREATE TABLE results (id INT AUTO_INCREMENT PRIMARY KEY,title VARCHAR(255),url TEXT,snippet TEXT,timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP);CREATE DATABASE SearchResultsDB; USE SearchResultsDB; CREATE TABLE results ( id INT AUTO_INCREMENT PRIMARY KEY, title VARCHAR(255), url TEXT, snippet TEXT, timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP );CREATE DATABASE SearchResultsDB; USE SearchResultsDB; CREATE TABLE results ( id INT AUTO_INCREMENT PRIMARY KEY, title VARCHAR(255), url TEXT, snippet TEXT, timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP );
Inserting Data into MySQL
We use JDBC to insert scraped data into the database:
Plain textCopy to clipboardOpen code in new windowEnlighterJS 3 Syntax Highlighterimport java.sql.Connection;import java.sql.DriverManager;import java.sql.PreparedStatement;import java.util.List;import org.jsoup.nodes.Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;public class DatabaseHandler {private static final String DB_URL = "jdbc:mysql://localhost:3306/SearchResultsDB";private static final String USER = "root";private static final String PASSWORD = "password";public static void saveResults(Document doc) {String sql = "INSERT INTO results (title, url, snippet) VALUES (?, ?, ?)";try (Connection conn = DriverManager.getConnection(DB_URL, USER, PASSWORD);PreparedStatement stmt = conn.prepareStatement(sql)) {Elements results = doc.select(".result__title a");for (Element result : results) {String title = result.text();String url = result.attr("href");String snippet = result.parent().nextElementSibling().text();stmt.setString(1, title);stmt.setString(2, url);stmt.setString(3, snippet);stmt.executeUpdate();}System.out.println("Data successfully stored in the database.");} catch (Exception e) {e.printStackTrace();}}}import java.sql.Connection; import java.sql.DriverManager; import java.sql.PreparedStatement; import java.util.List; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class DatabaseHandler { private static final String DB_URL = "jdbc:mysql://localhost:3306/SearchResultsDB"; private static final String USER = "root"; private static final String PASSWORD = "password"; public static void saveResults(Document doc) { String sql = "INSERT INTO results (title, url, snippet) VALUES (?, ?, ?)"; try (Connection conn = DriverManager.getConnection(DB_URL, USER, PASSWORD); PreparedStatement stmt = conn.prepareStatement(sql)) { Elements results = doc.select(".result__title a"); for (Element result : results) { String title = result.text(); String url = result.attr("href"); String snippet = result.parent().nextElementSibling().text(); stmt.setString(1, title); stmt.setString(2, url); stmt.setString(3, snippet); stmt.executeUpdate(); } System.out.println("Data successfully stored in the database."); } catch (Exception e) { e.printStackTrace(); } } }import java.sql.Connection; import java.sql.DriverManager; import java.sql.PreparedStatement; import java.util.List; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class DatabaseHandler { private static final String DB_URL = "jdbc:mysql://localhost:3306/SearchResultsDB"; private static final String USER = "root"; private static final String PASSWORD = "password"; public static void saveResults(Document doc) { String sql = "INSERT INTO results (title, url, snippet) VALUES (?, ?, ?)"; try (Connection conn = DriverManager.getConnection(DB_URL, USER, PASSWORD); PreparedStatement stmt = conn.prepareStatement(sql)) { Elements results = doc.select(".result__title a"); for (Element result : results) { String title = result.text(); String url = result.attr("href"); String snippet = result.parent().nextElementSibling().text(); stmt.setString(1, title); stmt.setString(2, url); stmt.setString(3, snippet); stmt.executeUpdate(); } System.out.println("Data successfully stored in the database."); } catch (Exception e) { e.printStackTrace(); } } }
Handling Anti-Scraping Mechanisms
Although DuckDuckGo is relatively lenient, websites often implement anti-scraping mechanisms. Here are some best practices to avoid detection:
- Use Rotating Proxies: Change IP addresses to avoid being blocked.
- Set User-Agent Headers: Mimic a real web browser.
- Introduce Delays: Add random delays between requests to avoid rate-limiting.
- Use Headless Browsers: If JavaScript rendering is needed, tools like Selenium can be useful.
Conclusion
Scraping search results from DuckDuckGo using Java and a proxy is a powerful technique for data collection. By leveraging JSoup for parsing, Apache HttpClient for requests, and MySQL for storage, we can efficiently extract and manage search engine data. Implementing best practices such as using rotating proxies and setting headers ensures a smooth scraping experience. Whether for SEO analysis, market research, or academic studies, web scraping from DuckDuckGo provides valuable insights while respecting user privacy.