Web Scraping with C++ and MariaDB: A Beginner's Tutorial

Web Scraping with C++ and MariaDB: A Beginner’s Tutorial

Web scraping is a powerful technique used to extract data from websites. It is widely used in various fields such as data analysis, market research, and competitive intelligence. In this tutorial, we will explore how to perform web scraping using C++ and store the extracted data in a MariaDB database. This guide is designed for beginners and will provide a step-by-step approach to get you started.

Understanding Web Scraping

Web scraping involves fetching data from web pages and processing it to extract useful information. It is important to note that web scraping should be done ethically and in compliance with the website’s terms of service. Many websites provide APIs for data access, which should be preferred over scraping when available.

In this tutorial, we will use C++ for web scraping due to its performance and efficiency. C++ is a powerful language that allows for fine-grained control over system resources, making it suitable for tasks that require high performance.

Setting Up the Environment

Before we begin, we need to set up our development environment. This involves installing the necessary tools and libraries for C++ and MariaDB. We will use the following tools:

GCC: A compiler for C++.
libcurl: A library for transferring data with URLs, which we will use for HTTP requests.
MariaDB: A popular open-source database management system.

To install these tools, you can use a package manager like apt for Linux or Homebrew for macOS. For Windows, you can use MinGW for GCC and download the MariaDB installer from the official website.

Writing the Web Scraping Code in C++

Now that our environment is set up, let’s write the C++ code to perform web scraping. We will use libcurl to send HTTP requests and retrieve web page content. Below is a simple example of how to fetch a web page using C++ and libcurl:

#include

size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* s) {

size_t newLength = size * nmemb;

try {

s->append((char*)contents, newLength);

} catch (std::bad_alloc& e) {

return 0;

}

return newLength;

}

int main() {

CURL* curl;

CURLcode res;

std::string readBuffer;

curl = curl_easy_init();

if(curl) {

curl_easy_setopt(curl, CURLOPT_URL, "http://example.com");

curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);

curl_easy_setopt(curl, CURLOPT_WRITEDATA, &readBuffer);

res = curl_easy_perform(curl);

curl_easy_cleanup(curl);

if(res == CURLE_OK) {

std::cout << "Data fetched successfully:n" << readBuffer << std::endl;

} else {

std::cerr << "Error: " << curl_easy_strerror(res) << std::endl;

}

return 0;

}

#include #include size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* s) { size_t newLength = size * nmemb; try { s->append((char*)contents, newLength); } catch (std::bad_alloc& e) { return 0; } return newLength; } int main() { CURL* curl; CURLcode res; std::string readBuffer; curl = curl_easy_init(); if(curl) { curl_easy_setopt(curl, CURLOPT_URL, "http://example.com"); curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback); curl_easy_setopt(curl, CURLOPT_WRITEDATA, &readBuffer); res = curl_easy_perform(curl); curl_easy_cleanup(curl); if(res == CURLE_OK) { std::cout << "Data fetched successfully:n" << readBuffer << std::endl; } else { std::cerr << "Error: " << curl_easy_strerror(res) << std::endl; } } return 0; }

#include 
#include 

size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* s) {
    size_t newLength = size * nmemb;
    try {
        s->append((char*)contents, newLength);
    } catch (std::bad_alloc& e) {
        return 0;
    }
    return newLength;
}

int main() {
    CURL* curl;
    CURLcode res;
    std::string readBuffer;

    curl = curl_easy_init();
    if(curl) {
        curl_easy_setopt(curl, CURLOPT_URL, "http://example.com");
        curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
        curl_easy_setopt(curl, CURLOPT_WRITEDATA, &readBuffer);
        res = curl_easy_perform(curl);
        curl_easy_cleanup(curl);

        if(res == CURLE_OK) {
            std::cout << "Data fetched successfully:n" << readBuffer << std::endl;
        } else {
            std::cerr << "Error: " << curl_easy_strerror(res) << std::endl;
        }
    }
    return 0;
}

This code initializes a CURL session, sets the URL to fetch, and defines a callback function to handle the data received. The fetched data is stored in a string and printed to the console.

Storing Data in MariaDB

Once we have extracted the data, the next step is to store it in a MariaDB database. MariaDB is a robust and scalable database system that is compatible with MySQL. We will use the MariaDB C API to interact with the database.

First, we need to create a database and a table to store the scraped data. Here is a simple SQL script to create a database and a table:

CREATE DATABASE WebScrapingDB;

USE WebScrapingDB;

CREATE TABLE ScrapedData (

id INT AUTO_INCREMENT PRIMARY KEY,

content TEXT NOT NULL

);

CREATE DATABASE WebScrapingDB; USE WebScrapingDB; CREATE TABLE ScrapedData ( id INT AUTO_INCREMENT PRIMARY KEY, content TEXT NOT NULL );

CREATE DATABASE WebScrapingDB;
USE WebScrapingDB;

CREATE TABLE ScrapedData (
    id INT AUTO_INCREMENT PRIMARY KEY,
    content TEXT NOT NULL
);

Next, we will write C++ code to insert the scraped data into the MariaDB database. Below is an example of how to connect to MariaDB and insert data:

#include

void insertData(const std::string& data) {

MYSQL* conn;

MYSQL_RES* res;

MYSQL_ROW row;

conn = mysql_init(NULL);

if (conn == NULL) {

std::cerr << "mysql_init() failedn";

return;

}

if (mysql_real_connect(conn, "localhost", "user", "password", "WebScrapingDB", 0, NULL, 0) == NULL) {

std::cerr << "mysql_real_connect() failedn";

mysql_close(conn);

return;

}

std::string query = "INSERT INTO ScrapedData (content) VALUES ('" + data + "')";

if (mysql_query(conn, query.c_str())) {

std::cerr << "INSERT failed. Error: " << mysql_error(conn) << std::endl;

}

mysql_close(conn);

}

int main() {

std::string scrapedData = "Sample scraped data";

insertData(scrapedData);

return 0;

}

#include #include void insertData(const std::string& data) { MYSQL* conn; MYSQL_RES* res; MYSQL_ROW row; conn = mysql_init(NULL); if (conn == NULL) { std::cerr << "mysql_init() failedn"; return; } if (mysql_real_connect(conn, "localhost", "user", "password", "WebScrapingDB", 0, NULL, 0) == NULL) { std::cerr << "mysql_real_connect() failedn"; mysql_close(conn); return; } std::string query = "INSERT INTO ScrapedData (content) VALUES ('" + data + "')"; if (mysql_query(conn, query.c_str())) { std::cerr << "INSERT failed. Error: " << mysql_error(conn) << std::endl; } mysql_close(conn); } int main() { std::string scrapedData = "Sample scraped data"; insertData(scrapedData); return 0; }

#include 
#include 

void insertData(const std::string& data) {
    MYSQL* conn;
    MYSQL_RES* res;
    MYSQL_ROW row;

    conn = mysql_init(NULL);
    if (conn == NULL) {
        std::cerr << "mysql_init() failedn";
        return;
    }

    if (mysql_real_connect(conn, "localhost", "user", "password", "WebScrapingDB", 0, NULL, 0) == NULL) {
        std::cerr << "mysql_real_connect() failedn";
        mysql_close(conn);
        return;
    }

    std::string query = "INSERT INTO ScrapedData (content) VALUES ('" + data + "')";
    if (mysql_query(conn, query.c_str())) {
        std::cerr << "INSERT failed. Error: " << mysql_error(conn) << std::endl;
    }

    mysql_close(conn);
}

int main() {
    std::string scrapedData = "Sample scraped data";
    insertData(scrapedData);
    return 0;
}

This code connects to the MariaDB database and inserts the scraped data into the ScrapedData table. Ensure that you replace “user” and “password” with your actual database credentials.

Conclusion

In this tutorial, we have covered the basics of web scraping using C++ and storing the extracted data in a MariaDB database. We started by setting up the development environment, then wrote C++ code to fetch web page content using libcurl. Finally, we demonstrated how to store the scraped data in a MariaDB database using the MariaDB C API.

Web scraping is a valuable skill that can be applied in various domains. As you gain more experience, you can explore more advanced topics such as handling dynamic content, using headless browsers, and implementing scraping frameworks. Remember to always respect the terms of service of the websites you scrape and consider using APIs when available.