Building a Dynamic Web Crawler in PHP: A Practical Tutorial Scraping Carousell.com.my As An Example

Web scraping is an invaluable skill for anyone looking to extract and analyze data from websites. Whether you’re gathering data for research, monitoring prices, or simply collecting information, a web crawler can help automate the process. In this tutorial, we’ll guide you through building a dynamic web crawler using PHP, with Carousell Malaysia (carousell.com.my) as our example website.

We’ll create a PHP script that starts from a given URL, dynamically discovers all relevant pages, and extracts specific data like the product title, featured image URL, description, price, and condition. Additionally, we’ll ensure that the script can operate through a proxy, which is useful for scraping sites that have access restrictions or rate limits.

Introduction

PHP is a versatile and widely-used scripting language that makes it easy to build a basic web crawler. In this tutorial, we’ll focus on crawling a sample website—Carousell Malaysia. We’ll show you how to set up a dynamic crawler that can navigate through pages, extract specific data, and save the results in a CSV file.

This tutorial is ideal for developers, data analysts, and anyone interested in web scraping. By the end, you’ll have a fully functional PHP crawler that you can adapt to other websites as needed.

Prerequisites

Before you begin, make sure you have:

  • A working knowledge of PHP.
  • A PHP environment set up on your local machine.
  • A text editor or an integrated development environment (IDE).
  • Access to a web server (like Apache or Nginx) with PHP support.
  • If needed, proxy details (IP, port, username, and password) for sites that require proxy access.

Step 1: Set Up the PHP Crawler Script

The first step is to create the PHP script that will perform the web crawling. Below is the PHP code you’ll use:

<?php

// Define the base URL to start crawling
$base_url = "https://www.carousell.com.my/";  // Input the URL you want to start crawling from

// Define your proxy details (if needed)
$proxy = "your_proxy_ip:your_proxy_port";  // Input your proxy IP and port
$proxy_auth = "username:password";  // Input your proxy username and password

// Initialize an array to store crawled data and visited URLs
$data = [];
$visited_urls = [];

// Function to get the HTML content of a URL using a proxy
function get_html($url, $proxy = null, $proxy_auth = null) {
    $options = [
        'http' => [
            'method' => "GET",
            'header' => "User-Agent: PHP-Crawler/1.0rn"
        ]
    ];

    // Add proxy settings if provided
    if ($proxy && $proxy_auth) {
        $options['http']['header'] .= "Proxy-Authorization: Basic " . base64_encode($proxy_auth) . "rn";
        $options['http']['proxy'] = "tcp://$proxy";
        $options['http']['request_fulluri'] = true;
    }

    $context = stream_context_create($options);
    return file_get_contents($url, false, $context);
}

// Function to extract product details from the HTML
function extract_data($html, $url) {
    $dom = new DOMDocument;
    
    // Suppress errors due to malformed HTML
    @$dom->loadHTML($html);

    $data = [];

    // XPath to locate specific elements
    $xpath = new DOMXPath($dom);

    // Extract the title
    $title_nodes = $xpath->query('//meta[@property="og:title"]');
    $title = ($title_nodes->length > 0) ? $title_nodes->item(0)->getAttribute('content') : 'N/A';

    // Extract the featured image
    $image_nodes = $xpath->query('//meta[@property="og:image"]');
    $featured_image = ($image_nodes->length > 0) ? $image_nodes->item(0)->getAttribute('content') : 'N/A';

    // Extract description
    $description_nodes = $xpath->query('//meta[@property="og:description"]');
    $description = ($description_nodes->length > 0) ? $description_nodes->item(0)->getAttribute('content') : 'N/A';

    // Extract price
    $price_nodes = $xpath->query('//div[contains(@class, "MMOxT3+_JqMFMZ4RTs4g") and contains(@class, "MMOxT3+_eJf5luTP5NQvc")]');
    $price = ($price_nodes->length > 0) ? trim($price_nodes->item(0)->nodeValue) : 'N/A';

    // Extract condition
    $condition_nodes = $xpath->query('//span[contains(@class, "_2vJS6pPzGwaXAfZEqgKkZn") and contains(text(), "Condition")]');
    $condition = ($condition_nodes->length > 0) ? trim($condition_nodes->item(0)->nextSibling->nodeValue) : 'N/A';

    // Add the extracted data to the result array
    $data = [
        'title' => $title,
        'image' => $featured_image,
        'url' => $url,
        'description' => $description,
        'price' => $price,
        'condition' => $condition
    ];

    return $data;
}

// Function to extract and crawl URLs from a page
function extract_and_crawl_urls($html, $base_url, &$data, &$visited_urls, $proxy, $proxy_auth) {
    $dom = new DOMDocument;
    
    // Suppress errors due to malformed HTML
    @$dom->loadHTML($html);

    // XPath to locate anchor elements
    $xpath = new DOMXPath($dom);
    $url_nodes = $xpath->query('//a[@href]');

    // Crawl each discovered URL
    foreach ($url_nodes as $url_node) {
        $relative_url = $url_node->getAttribute('href');
        $absolute_url = filter_var($relative_url, FILTER_VALIDATE_URL) ? $relative_url : rtrim($base_url, '/') . '/' . ltrim($relative_url, '/');
        
        // Filter and ensure we only crawl carousell.com.my pages and avoid already visited URLs
        if (strpos($absolute_url, 'carousell.com.my') !== false && !isset($visited_urls[$absolute_url])) {
            $visited_urls[$absolute_url] = true;
            
            // Crawl the page if it contains the structure of a product listing page
            if (preg_match('//p//', $absolute_url)) {
                $page_html = get_html($absolute_url, $proxy, $proxy_auth);
                $page_data = extract_data($page_html, $absolute_url);
                $data[] = $page_data;
            } else {
                // Recursively crawl other pages to discover more URLs
                $page_html = get_html($absolute_url, $proxy, $proxy_auth);
                extract_and_crawl_urls($page_html, $base_url, $data, $visited_urls, $proxy, $proxy_auth);
            }
        }
    }
}

// Start crawling from the base URL
$html = get_html($base_url, $proxy, $proxy_auth);
extract_and_crawl_urls($html, $base_url, $data, $visited_urls, $proxy, $proxy_auth);

// Output the results to a CSV file
$csv_file = fopen("crawled_data.csv", "w");

// Write the headers to the CSV
fputcsv($csv_file, ['Title', 'Featured Image URL', 'Page URL', 'Description', 'Price', 'Condition']);

// Write each row of data
foreach ($data as $row) {
    fputcsv($csv_file, $row);
}

fclose($csv_file);

echo "Crawling complete! Check crawled_data.csv for the results.n";

?>

Step 2: Input Your Variables

Before running the script, you’ll need to input specific details into the code:

  1. Base URL to Crawl:
    • Line 4: Replace the $base_url variable with the URL of the site you want to start crawling from. In this tutorial, we start with https://www.carousell.com.my/.
  2. Proxy Settings (Optional):
    • Line 7: If you’re using a proxy, replace the $proxy variable with your proxy’s IP and port (e.g., "123.456.789.000:8080").
    • Line 8: Replace the $proxy_auth variable with your proxy’s username and password (e.g., "user:password"). If no proxy is needed, you can leave these variables as null.

Step 3: Running the Crawler

With your variables set, you can now run the script:

  1. Save the PHP Script: Save the provided PHP code as crawler.php in your project directory.
  2. Run the Script: Open your terminal or command prompt, navigate to the directory where crawler.php is saved, and run:
php crawler.php
  1. Check the Output: After the script finishes running, you’ll find a crawled_data.csv file in the same directory. This file contains all the data extracted by your crawler.

Step 4: Understanding the Output

The generated CSV file will have the following columns:

  • Title: The title of the product listing.
  • Featured Image URL: The URL of the featured image associated with the product.
  • Page URL: The URL of the product listing page.
  • Description: The description provided in the product listing.
  • Price: The listed price of the product.
  • Condition: The condition of the product (e.g., “Brand New”, “Used”).

Here’s an example of how the CSV might look:

Title Featured Image URL Page URL Description Price Condition
Clearance Dell Alienware X17 R2 X15 R2 M17 R5 M15 R7 M16 R1 Image URL Product URL Brand new Alienware models available at clearance prices! RM 9,999 Brand New

Conclusion

In this tutorial, we’ve created a dynamic web crawler in PHP that can discover and extract data from pages on carousell.com.my. The script is designed to start from a specified URL, dynamically navigate through the site, and gather relevant product information. By incorporating proxy support, you can ensure that your crawler operates effectively even in restricted or rate-limited environments.

This tutorial provides a practical example of using PHP for web scraping and can be adapted to other websites or more complex scraping tasks as needed. Whether you’re gathering data for analysis, research, or other purposes, this PHP crawler offers a solid foundation for your web scraping projects.

Responses

Related Projects