{"id":834,"date":"2024-09-02T13:20:38","date_gmt":"2024-09-02T13:20:38","guid":{"rendered":"https:\/\/dev2.rayobyte.com\/community\/?post_type=scraping_project&#038;p=834"},"modified":"2024-10-09T16:24:10","modified_gmt":"2024-10-09T16:24:10","slug":"building-a-dynamic-web-crawler-in-php-a-practical-tutorial-scraping-carousell-com-my-as-an-example","status":"publish","type":"scraping_project","link":"https:\/\/rayobyte.com\/community\/scraping-project\/building-a-dynamic-web-crawler-in-php-a-practical-tutorial-scraping-carousell-com-my-as-an-example\/","title":{"rendered":"Building a Dynamic Web Crawler in PHP: A Practical Tutorial Scraping Carousell.com.my As An Example"},"content":{"rendered":"<p style=\"text-align: left\">Web scraping is an invaluable skill for anyone looking to extract and analyze data from websites. Whether you&#8217;re gathering data for research, monitoring prices, or simply collecting information, a web crawler can help automate the process. In this tutorial, we&#8217;ll guide you through building a dynamic web crawler using PHP, with Carousell Malaysia (<code>carousell.com.my<\/code>) as our example website.<\/p>\n<p>We&#8217;ll create a PHP script that starts from a given URL, dynamically discovers all relevant pages, and extracts specific data like the product title, featured image URL, description, price, and condition. Additionally, we&#8217;ll ensure that the script can operate through a proxy, which is useful for scraping sites that have access restrictions or rate limits.<\/p>\n<h3>Introduction<\/h3>\n<p>PHP is a versatile and widely-used scripting language that makes it easy to build a basic web crawler. In this tutorial, we&#8217;ll focus on crawling a sample website\u2014Carousell Malaysia. We&#8217;ll show you how to set up a dynamic crawler that can navigate through pages, extract specific data, and save the results in a CSV file.<\/p>\n<p>This tutorial is ideal for developers, data analysts, and anyone interested in web scraping. By the end, you&#8217;ll have a fully functional PHP crawler that you can adapt to other websites as needed.<\/p>\n<h3>Prerequisites<\/h3>\n<p>Before you begin, make sure you have:<\/p>\n<ul>\n<li>A working knowledge of PHP.<\/li>\n<li>A PHP environment set up on your local machine.<\/li>\n<li>A text editor or an integrated development environment (IDE).<\/li>\n<li>Access to a web server (like Apache or Nginx) with PHP support.<\/li>\n<li>If needed, proxy details (IP, port, username, and password) for sites that require proxy access.<\/li>\n<\/ul>\n<h3>Step 1: Set Up the PHP Crawler Script<\/h3>\n<p>The first step is to create the PHP script that will perform the web crawling. Below is the PHP code you&#8217;ll use:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">&lt;?php\n\n\/\/ Define the base URL to start crawling\n$base_url = \"https:\/\/www.carousell.com.my\/\";  \/\/ Input the URL you want to start crawling from\n\n\/\/ Define your proxy details (if needed)\n$proxy = \"your_proxy_ip:your_proxy_port\";  \/\/ Input your proxy IP and port\n$proxy_auth = \"username:password\";  \/\/ Input your proxy username and password\n\n\/\/ Initialize an array to store crawled data and visited URLs\n$data = [];\n$visited_urls = [];\n\n\/\/ Function to get the HTML content of a URL using a proxy\nfunction get_html($url, $proxy = null, $proxy_auth = null) {\n    $options = [\n        'http' =&gt; [\n            'method' =&gt; \"GET\",\n            'header' =&gt; \"User-Agent: PHP-Crawler\/1.0rn\"\n        ]\n    ];\n\n    \/\/ Add proxy settings if provided\n    if ($proxy &amp;&amp; $proxy_auth) {\n        $options['http']['header'] .= \"Proxy-Authorization: Basic \" . base64_encode($proxy_auth) . \"rn\";\n        $options['http']['proxy'] = \"tcp:\/\/$proxy\";\n        $options['http']['request_fulluri'] = true;\n    }\n\n    $context = stream_context_create($options);\n    return file_get_contents($url, false, $context);\n}\n\n\/\/ Function to extract product details from the HTML\nfunction extract_data($html, $url) {\n    $dom = new DOMDocument;\n    \n    \/\/ Suppress errors due to malformed HTML\n    @$dom-&gt;loadHTML($html);\n\n    $data = [];\n\n    \/\/ XPath to locate specific elements\n    $xpath = new DOMXPath($dom);\n\n    \/\/ Extract the title\n    $title_nodes = $xpath-&gt;query('\/\/meta[@property=\"og:title\"]');\n    $title = ($title_nodes-&gt;length &gt; 0) ? $title_nodes-&gt;item(0)-&gt;getAttribute('content') : 'N\/A';\n\n    \/\/ Extract the featured image\n    $image_nodes = $xpath-&gt;query('\/\/meta[@property=\"og:image\"]');\n    $featured_image = ($image_nodes-&gt;length &gt; 0) ? $image_nodes-&gt;item(0)-&gt;getAttribute('content') : 'N\/A';\n\n    \/\/ Extract description\n    $description_nodes = $xpath-&gt;query('\/\/meta[@property=\"og:description\"]');\n    $description = ($description_nodes-&gt;length &gt; 0) ? $description_nodes-&gt;item(0)-&gt;getAttribute('content') : 'N\/A';\n\n    \/\/ Extract price\n    $price_nodes = $xpath-&gt;query('\/\/div[contains(@class, \"MMOxT3+_JqMFMZ4RTs4g\") and contains(@class, \"MMOxT3+_eJf5luTP5NQvc\")]');\n    $price = ($price_nodes-&gt;length &gt; 0) ? trim($price_nodes-&gt;item(0)-&gt;nodeValue) : 'N\/A';\n\n    \/\/ Extract condition\n    $condition_nodes = $xpath-&gt;query('\/\/span[contains(@class, \"_2vJS6pPzGwaXAfZEqgKkZn\") and contains(text(), \"Condition\")]');\n    $condition = ($condition_nodes-&gt;length &gt; 0) ? trim($condition_nodes-&gt;item(0)-&gt;nextSibling-&gt;nodeValue) : 'N\/A';\n\n    \/\/ Add the extracted data to the result array\n    $data = [\n        'title' =&gt; $title,\n        'image' =&gt; $featured_image,\n        'url' =&gt; $url,\n        'description' =&gt; $description,\n        'price' =&gt; $price,\n        'condition' =&gt; $condition\n    ];\n\n    return $data;\n}\n\n\/\/ Function to extract and crawl URLs from a page\nfunction extract_and_crawl_urls($html, $base_url, &amp;$data, &amp;$visited_urls, $proxy, $proxy_auth) {\n    $dom = new DOMDocument;\n    \n    \/\/ Suppress errors due to malformed HTML\n    @$dom-&gt;loadHTML($html);\n\n    \/\/ XPath to locate anchor elements\n    $xpath = new DOMXPath($dom);\n    $url_nodes = $xpath-&gt;query('\/\/a[@href]');\n\n    \/\/ Crawl each discovered URL\n    foreach ($url_nodes as $url_node) {\n        $relative_url = $url_node-&gt;getAttribute('href');\n        $absolute_url = filter_var($relative_url, FILTER_VALIDATE_URL) ? $relative_url : rtrim($base_url, '\/') . '\/' . ltrim($relative_url, '\/');\n        \n        \/\/ Filter and ensure we only crawl carousell.com.my pages and avoid already visited URLs\n        if (strpos($absolute_url, 'carousell.com.my') !== false &amp;&amp; !isset($visited_urls[$absolute_url])) {\n            $visited_urls[$absolute_url] = true;\n            \n            \/\/ Crawl the page if it contains the structure of a product listing page\n            if (preg_match('\/\/p\/\/', $absolute_url)) {\n                $page_html = get_html($absolute_url, $proxy, $proxy_auth);\n                $page_data = extract_data($page_html, $absolute_url);\n                $data[] = $page_data;\n            } else {\n                \/\/ Recursively crawl other pages to discover more URLs\n                $page_html = get_html($absolute_url, $proxy, $proxy_auth);\n                extract_and_crawl_urls($page_html, $base_url, $data, $visited_urls, $proxy, $proxy_auth);\n            }\n        }\n    }\n}\n\n\/\/ Start crawling from the base URL\n$html = get_html($base_url, $proxy, $proxy_auth);\nextract_and_crawl_urls($html, $base_url, $data, $visited_urls, $proxy, $proxy_auth);\n\n\/\/ Output the results to a CSV file\n$csv_file = fopen(\"crawled_data.csv\", \"w\");\n\n\/\/ Write the headers to the CSV\nfputcsv($csv_file, ['Title', 'Featured Image URL', 'Page URL', 'Description', 'Price', 'Condition']);\n\n\/\/ Write each row of data\nforeach ($data as $row) {\n    fputcsv($csv_file, $row);\n}\n\nfclose($csv_file);\n\necho \"Crawling complete! Check crawled_data.csv for the results.n\";\n\n?&gt;\n<\/pre>\n<h3>Step 2: Input Your Variables<\/h3>\n<p>Before running the script, you&#8217;ll need to input specific details into the code:<\/p>\n<ol>\n<li><strong>Base URL to Crawl<\/strong>:\n<ul>\n<li><strong>Line 4<\/strong>: Replace the <code>$base_url<\/code> variable with the URL of the site you want to start crawling from. In this tutorial, we start with <code>https:\/\/www.carousell.com.my\/<\/code>.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Proxy Settings<\/strong> (Optional):\n<ul>\n<li><strong>Line 7<\/strong>: If you&#8217;re using a proxy, replace the <code>$proxy<\/code> variable with your proxy&#8217;s IP and port (e.g., <code>\"123.456.789.000:8080\"<\/code>).<\/li>\n<li><strong>Line 8<\/strong>: Replace the <code>$proxy_auth<\/code> variable with your proxy&#8217;s username and password (e.g., <code>\"user:password\"<\/code>). If no proxy is needed, you can leave these variables as <code>null<\/code>.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<h3>Step 3: Running the Crawler<\/h3>\n<p>With your variables set, you can now run the script:<\/p>\n<ol>\n<li><strong>Save the PHP Script<\/strong>: Save the provided PHP code as <code>crawler.php<\/code> in your project directory.<\/li>\n<li><strong>Run the Script<\/strong>: Open your terminal or command prompt, navigate to the directory where <code>crawler.php<\/code> is saved, and run:<\/li>\n<\/ol>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\">php crawler.php<\/pre>\n<ol>\n<li><strong>Check the Output<\/strong>: After the script finishes running, you&#8217;ll find a <code>crawled_data.csv<\/code> file in the same directory. This file contains all the data extracted by your crawler.<\/li>\n<\/ol>\n<h3>Step 4: Understanding the Output<\/h3>\n<p>The generated CSV file will have the following columns:<\/p>\n<ul>\n<li><strong>Title<\/strong>: The title of the product listing.<\/li>\n<li><strong>Featured Image URL<\/strong>: The URL of the featured image associated with the product.<\/li>\n<li><strong>Page URL<\/strong>: The URL of the product listing page.<\/li>\n<li><strong>Description<\/strong>: The description provided in the product listing.<\/li>\n<li><strong>Price<\/strong>: The listed price of the product.<\/li>\n<li><strong>Condition<\/strong>: The condition of the product (e.g., &#8220;Brand New&#8221;, &#8220;Used&#8221;).<\/li>\n<\/ul>\n<p>Here\u2019s an example of how the CSV might look:<\/p>\n<table style=\"height: 112px\" width=\"999\">\n<thead>\n<tr>\n<th>Title<\/th>\n<th>Featured Image URL<\/th>\n<th>Page URL<\/th>\n<th>Description<\/th>\n<th>Price<\/th>\n<th>Condition<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Clearance Dell Alienware X17 R2 X15 R2 M17 R5 M15 R7 M16 R1<\/td>\n<td><a href=\"https:\/\/rayobyte.com\/community\/wp-content\/uploads\/2024\/09\/gaming_laptop_design-alienware-x17-r1.jpg\" target=\"_new\" rel=\"noopener\">Image URL<\/a><\/td>\n<td><a href=\"https:\/\/www.carousell.com.my\/p\/clearance-dell-alienware-x17-r2-x15-r2-m17-r5-m15-r7-m16-r1-1322654111\/\" target=\"_new\" rel=\"noopener nofollow\">Product URL<\/a><\/td>\n<td>Brand new Alienware models available at clearance prices!<\/td>\n<td>RM 9,999<\/td>\n<td>Brand New<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>Conclusion<\/h3>\n<p>In this tutorial, we\u2019ve created a dynamic web crawler in PHP that can discover and extract data from pages on <code>carousell.com.my<\/code>. The script is designed to start from a specified URL, dynamically navigate through the site, and gather relevant product information. By incorporating proxy support, you can ensure that your crawler operates effectively even in restricted or rate-limited environments.<\/p>\n<p>This tutorial provides a practical example of using PHP for web scraping and can be adapted to other websites or more complex scraping tasks as needed. Whether you&#8217;re gathering data for analysis, research, or other purposes, this PHP crawler offers a solid foundation for your web scraping projects.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Web scraping is an invaluable skill for anyone looking to extract and analyze data from websites. Whether you&#8217;re gathering data for research, monitoring prices, or&hellip;<\/p>\n","protected":false},"author":18,"featured_media":835,"comment_status":"open","ping_status":"closed","template":"","meta":{"rank_math_lock_modified_date":false},"categories":[],"class_list":["post-834","scraping_project","type-scraping_project","status-publish","has-post-thumbnail","hentry"],"_links":{"self":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/scraping_project\/834","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/scraping_project"}],"about":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/types\/scraping_project"}],"author":[{"embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/comments?post=834"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/media\/835"}],"wp:attachment":[{"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/media?parent=834"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rayobyte.com\/community\/wp-json\/wp\/v2\/categories?post=834"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}