Daily Mail Scraper with JavaScript and PostgreSQL

Daily Mail Scraper with JavaScript and PostgreSQL

Web scraping has become an essential tool for data enthusiasts, researchers, and businesses looking to gather information from the web. In this article, we will explore how to create a web scraper for the Daily Mail using JavaScript and PostgreSQL. This combination allows for efficient data extraction and storage, providing a robust solution for handling large datasets.

Understanding Web Scraping

Web scraping involves extracting data from websites and transforming it into a structured format. This process is crucial for various applications, such as market research, sentiment analysis, and competitive analysis. By automating data collection, web scraping saves time and effort compared to manual data gathering.

However, web scraping must be conducted ethically and legally. It’s important to respect the terms of service of websites and ensure that scraping activities do not overload servers or infringe on intellectual property rights.

Why Use JavaScript for Web Scraping?

JavaScript is a versatile language that can be used both on the client-side and server-side. For web scraping, JavaScript offers several advantages:

  • Asynchronous Operations: JavaScript’s asynchronous nature allows for non-blocking operations, making it ideal for handling multiple requests simultaneously.
  • Rich Ecosystem: With libraries like Puppeteer and Cheerio, JavaScript provides powerful tools for navigating and manipulating web pages.
  • Cross-Platform Compatibility: JavaScript can run on various platforms, making it accessible for developers with different operating systems.

These features make JavaScript a popular choice for web scraping tasks, especially when dealing with dynamic websites like the Daily Mail.

Setting Up the Environment

Before diving into the code, it’s essential to set up the development environment. You’ll need Node.js installed on your machine, as it provides the runtime for executing JavaScript code outside the browser. Additionally, you’ll need PostgreSQL, a powerful open-source relational database, to store the scraped data.

To install Node.js, visit the official website and download the installer for your operating system. For PostgreSQL, you can use package managers like Homebrew for macOS or apt for Ubuntu. Once installed, ensure that both Node.js and PostgreSQL are correctly configured by running the following commands in your terminal:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
node -v
psql --version
node -v psql --version
node -v
psql --version

Building the Daily Mail Scraper

Now that the environment is set up, let’s start building the Daily Mail scraper. We’ll use Puppeteer, a Node.js library that provides a high-level API for controlling headless Chrome or Chromium browsers. This allows us to interact with web pages as if we were using a real browser.

First, create a new directory for your project and initialize a Node.js project:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
mkdir daily-mail-scraper
cd daily-mail-scraper
npm init -y
mkdir daily-mail-scraper cd daily-mail-scraper npm init -y
mkdir daily-mail-scraper
cd daily-mail-scraper
npm init -y

Next, install Puppeteer:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
npm install puppeteer
npm install puppeteer
npm install puppeteer

With Puppeteer installed, we can start writing the scraper script. The following code demonstrates how to navigate to the Daily Mail website and extract article titles:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.dailymail.co.uk');
const articleTitles = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.article .linkro-darkred')).map(article => article.textContent.trim());
});
console.log(articleTitles);
await browser.close();
})();
const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://www.dailymail.co.uk'); const articleTitles = await page.evaluate(() => { return Array.from(document.querySelectorAll('.article .linkro-darkred')).map(article => article.textContent.trim()); }); console.log(articleTitles); await browser.close(); })();
const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://www.dailymail.co.uk');

  const articleTitles = await page.evaluate(() => {
    return Array.from(document.querySelectorAll('.article .linkro-darkred')).map(article => article.textContent.trim());
  });

  console.log(articleTitles);

  await browser.close();
})();

This script launches a headless browser, navigates to the Daily Mail homepage, and extracts the titles of articles using CSS selectors. The extracted data is then logged to the console.

Storing Data in PostgreSQL

Once we have the scraped data, the next step is to store it in a PostgreSQL database. This allows for efficient querying and analysis of the data. First, ensure that PostgreSQL is running and create a new database:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
createdb daily_mail
createdb daily_mail
createdb daily_mail

Next, connect to the database and create a table to store the article titles:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
psql -d daily_mail
CREATE TABLE articles (
id SERIAL PRIMARY KEY,
title TEXT NOT NULL
);
q
psql -d daily_mail CREATE TABLE articles ( id SERIAL PRIMARY KEY, title TEXT NOT NULL ); q
psql -d daily_mail
CREATE TABLE articles (
  id SERIAL PRIMARY KEY,
  title TEXT NOT NULL
);
q

With the database set up, we can modify our scraper script to insert the extracted titles into the database. We’ll use the `pg` library to interact with PostgreSQL from Node.js:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
npm install pg
npm install pg
npm install pg

Update the scraper script to include database insertion:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
const { Client } = require('pg');
const puppeteer = require('puppeteer');
(async () => {
const client = new Client({
user: 'your_username',
host: 'localhost',
database: 'daily_mail',
password: 'your_password',
port: 5432,
});
await client.connect();
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.dailymail.co.uk');
const articleTitles = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.article .linkro-darkred')).map(article => article.textContent.trim());
});
for (const title of articleTitles) {
await client.query('INSERT INTO articles (title) VALUES ($1)', [title]);
}
await browser.close();
await client.end();
})();
const { Client } = require('pg'); const puppeteer = require('puppeteer'); (async () => { const client = new Client({ user: 'your_username', host: 'localhost', database: 'daily_mail', password: 'your_password', port: 5432, }); await client.connect(); const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://www.dailymail.co.uk'); const articleTitles = await page.evaluate(() => { return Array.from(document.querySelectorAll('.article .linkro-darkred')).map(article => article.textContent.trim()); }); for (const title of articleTitles) { await client.query('INSERT INTO articles (title) VALUES ($1)', [title]); } await browser.close(); await client.end(); })();
const { Client } = require('pg');
const puppeteer = require('puppeteer');

(async () => {
  const client = new Client({
    user: 'your_username',
    host: 'localhost',
    database: 'daily_mail',
    password: 'your_password',
    port: 5432,
  });

  await client.connect();

  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://www.dailymail.co.uk');

  const articleTitles = await page.evaluate(() => {
    return Array.from(document.querySelectorAll('.article .linkro-darkred')).map(article => article.textContent.trim());
  });

  for (const title of articleTitles) {
    await client.query('INSERT INTO articles (title) VALUES ($1)', [title]);
  }

  await browser.close();
  await client.end();
})();

This updated script connects to the PostgreSQL database and inserts each article title into the `articles` table. Ensure you replace `your_username` and `your_password` with your actual PostgreSQL credentials.

Conclusion

In this article, we explored how to build a web scraper for the Daily Mail using JavaScript and PostgreSQL. By leveraging Puppeteer, we were able to extract article titles from the website efficiently. We then stored the data in a PostgreSQL database, allowing for easy access and analysis.

Web scraping is a powerful tool, but it’s important to use it responsibly and ethically. Always respect website terms of service and ensure that your scraping activities do not negatively impact the website’s performance.

With the knowledge gained from this article, you can now apply similar techniques to scrape data from other websites and store it in a structured

Responses

Related blogs

an introduction to web scraping with NodeJS and Firebase. A futuristic display showcases NodeJS code extrac
parsing XML using Ruby and Firebase. A high-tech display showcases Ruby code parsing XML data structure
handling timeouts in Python Requests with Firebase. A high-tech display showcases Python code implement
downloading a file with cURL in Ruby and Firebase. A high-tech display showcases Ruby code using cURL t