Scraping Data from Any Website to Excel Using Java and PostgreSQL

Scraping Data from Any Website to Excel Using Java and PostgreSQL

In today’s data-driven world, the ability to extract and analyze data from various sources is crucial for businesses and researchers alike. Web scraping is a powerful technique that allows you to gather data from websites and store it in a structured format for further analysis. This article will guide you through the process of scraping data from any website and exporting it to Excel using Java and PostgreSQL. We will explore the tools and techniques required, provide code examples, and discuss best practices to ensure successful data extraction.

Understanding Web Scraping

Web scraping involves the automated extraction of data from websites. It is a valuable tool for collecting information from the web, such as product prices, user reviews, or news articles. However, it’s important to note that web scraping should be done ethically and in compliance with the website’s terms of service.

There are several libraries and frameworks available for web scraping in Java, such as Jsoup and Selenium. These tools allow you to parse HTML, interact with web pages, and extract the desired data. In this article, we will focus on using Jsoup for its simplicity and efficiency.

Setting Up Your Environment

Before we begin, ensure that you have Java Development Kit (JDK) installed on your system. You will also need to set up a PostgreSQL database to store the scraped data. PostgreSQL is a powerful open-source relational database management system that is well-suited for handling large datasets.

To get started, download and install the latest version of PostgreSQL from the official website. Once installed, create a new database and a table to store the scraped data. Here is a sample SQL script to create a table:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
CREATE TABLE scraped_data (
id SERIAL PRIMARY KEY,
title VARCHAR(255),
price DECIMAL,
url TEXT
);
CREATE TABLE scraped_data ( id SERIAL PRIMARY KEY, title VARCHAR(255), price DECIMAL, url TEXT );
CREATE TABLE scraped_data (
    id SERIAL PRIMARY KEY,
    title VARCHAR(255),
    price DECIMAL,
    url TEXT
);

Scraping Data with Java and Jsoup

Jsoup is a popular Java library for working with real-world HTML. It provides a convenient API for extracting and manipulating data, making it an excellent choice for web scraping. Let’s look at a simple example of how to use Jsoup to scrape data from a website.

First, add the Jsoup library to your project. You can do this by including the following dependency in your Maven `pom.xml` file:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
org.jsoup
jsoup
1.14.3
org.jsoup jsoup 1.14.3
    org.jsoup
    jsoup
    1.14.3

Next, create a Java class to perform the web scraping. In this example, we will scrape product titles and prices from an e-commerce website:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class WebScraper {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("https://example.com/products").get();
Elements products = doc.select(".product");
for (Element product : products) {
String title = product.select(".product-title").text();
String price = product.select(".product-price").text();
System.out.println("Title: " + title + ", Price: " + price);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.IOException; public class WebScraper { public static void main(String[] args) { try { Document doc = Jsoup.connect("https://example.com/products").get(); Elements products = doc.select(".product"); for (Element product : products) { String title = product.select(".product-title").text(); String price = product.select(".product-price").text(); System.out.println("Title: " + title + ", Price: " + price); } } catch (IOException e) { e.printStackTrace(); } } }
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class WebScraper {
    public static void main(String[] args) {
        try {
            Document doc = Jsoup.connect("https://example.com/products").get();
            Elements products = doc.select(".product");

            for (Element product : products) {
                String title = product.select(".product-title").text();
                String price = product.select(".product-price").text();
                System.out.println("Title: " + title + ", Price: " + price);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Storing Data in PostgreSQL

Once you have successfully scraped the data, the next step is to store it in your PostgreSQL database. You can use Java Database Connectivity (JDBC) to connect to the database and execute SQL queries. Ensure that you have the PostgreSQL JDBC driver added to your project.

Here is an example of how to insert the scraped data into the PostgreSQL database:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;
public class DatabaseInserter {
private static final String URL = "jdbc:postgresql://localhost:5432/your_database";
private static final String USER = "your_username";
private static final String PASSWORD = "your_password";
public static void insertData(String title, String price, String url) {
String sql = "INSERT INTO scraped_data (title, price, url) VALUES (?, ?, ?)";
try (Connection conn = DriverManager.getConnection(URL, USER, PASSWORD);
PreparedStatement pstmt = conn.prepareStatement(sql)) {
pstmt.setString(1, title);
pstmt.setBigDecimal(2, new BigDecimal(price));
pstmt.setString(3, url);
pstmt.executeUpdate();
} catch (SQLException e) {
e.printStackTrace();
}
}
}
import java.sql.Connection; import java.sql.DriverManager; import java.sql.PreparedStatement; import java.sql.SQLException; public class DatabaseInserter { private static final String URL = "jdbc:postgresql://localhost:5432/your_database"; private static final String USER = "your_username"; private static final String PASSWORD = "your_password"; public static void insertData(String title, String price, String url) { String sql = "INSERT INTO scraped_data (title, price, url) VALUES (?, ?, ?)"; try (Connection conn = DriverManager.getConnection(URL, USER, PASSWORD); PreparedStatement pstmt = conn.prepareStatement(sql)) { pstmt.setString(1, title); pstmt.setBigDecimal(2, new BigDecimal(price)); pstmt.setString(3, url); pstmt.executeUpdate(); } catch (SQLException e) { e.printStackTrace(); } } }
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;

public class DatabaseInserter {
    private static final String URL = "jdbc:postgresql://localhost:5432/your_database";
    private static final String USER = "your_username";
    private static final String PASSWORD = "your_password";

    public static void insertData(String title, String price, String url) {
        String sql = "INSERT INTO scraped_data (title, price, url) VALUES (?, ?, ?)";

        try (Connection conn = DriverManager.getConnection(URL, USER, PASSWORD);
             PreparedStatement pstmt = conn.prepareStatement(sql)) {

            pstmt.setString(1, title);
            pstmt.setBigDecimal(2, new BigDecimal(price));
            pstmt.setString(3, url);
            pstmt.executeUpdate();

        } catch (SQLException e) {
            e.printStackTrace();
        }
    }
}

Exporting Data to Excel

After storing the data in PostgreSQL, you may want to export it to Excel for further analysis or reporting. Apache POI is a Java library that provides APIs for manipulating Microsoft Office documents, including Excel files.

To export data to Excel, add the Apache POI library to your project. You can include the following dependencies in your Maven `pom.xml` file:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
org.apache.poi
poi
5.0.0
org.apache.poi
poi-ooxml
5.0.0
org.apache.poi poi 5.0.0 org.apache.poi poi-ooxml 5.0.0
    org.apache.poi
    poi
    5.0.0


    org.apache.poi
    poi-ooxml
    5.0.0

Here is an example of how to export the data from PostgreSQL to an Excel file:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.apache.poi.ss.usermodel.*; import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

import java.io.FileOutputStream;
import java.io.IOException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;

public class ExcelExporter {
private static final String URL = “jdbc:postgresql://localhost:5432/your_database”;
private static final String USER = “your_username”;
private static final String PASSWORD = “your_password”;

public static void exportToExcel() {
String sql = “SELECT * FROM scraped_data”;

try (Connection conn = DriverManager.getConnection(URL, USER, PASSWORD);
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery(sql);
Workbook workbook = new XSSFWorkbook()) {

Sheet sheet = workbook.createSheet(“Scraped Data”);
Row headerRow = sheet.create

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter

Responses

Related blogs

an introduction to web scraping with NodeJS and Firebase. A futuristic display showcases NodeJS code extrac
parsing XML using Ruby and Firebase. A high-tech display showcases Ruby code parsing XML data structure
handling timeouts in Python Requests with Firebase. A high-tech display showcases Python code implement
downloading a file with cURL in Ruby and Firebase. A high-tech display showcases Ruby code using cURL t