The Ultimate Guide to Building a PHP Web Scraper
If you’re interested in building your own web scraper but you don’t know where to start, you’ve come to the right place. In this article, you’ll learn about building a web scraper with the most popular server-side programming language for websites: PHP.
Read on to learn about PHP web scrapers and how to write your own web scraper in PHP. By the end of this article, you’ll also learn about the best proxies to use for web scraping and the best practices for web scraping PHP.
If you’re looking for certain information, use the table of contents to jump ahead to a specific section.
What is a PHP Web Scraper?
A PHP website scraper, also known as a PHP parser for HTML, is a web scraper that’s programmed using PHP. One of the oldest and most popular web programming languages, PHP is often used by programmers due to its many advantages, including:
- PHP parsing HTML is easy to run. Unlike many other languages, you only need to have a machine with PHP installed to run PHP code.
- It uses simple syntax, making it a great choice for beginners.
- PHP has been around since 2005, which means there are many resources, frameworks, and tutorials for PHP.
- It’s great for extracting information from simple pages (i.e., pages that don’t use dynamic content).
- It can help you store or save the scraped data much easier.
However, there are also some disadvantages to using PHP to build your web scraper, including:
- It’s difficult to use PHP to scrape dynamic content compared to Javascript or Python.
- PHP code can be fragile. It can break if the web developer changes the target site’s HTML, leading to data loss.
- It’s almost impossible to get real-time data with PHP. As such, if you want to build a web scraper to get real-time data, you’ll need to use another programming language.
How to Make a Web Scraper in PHP
Before you start building a PHP web scraper, make sure you have:
- Basic knowledge of HTML
- Basic knowledge of PHP (and OOPS concept)
- Simple HTML DOM Parser, which is a library made for PHP versions from 5.6+ (This allows access to the target web page’s content easily, through selectors.)
- PHP5+ or 7+ on your computer
Now, let’s jump into the PHP web scraping tutorial.
Install the latest version of the simple HTML DOM parser.
First, you’ll need to make sure you have the latest version of the simple HTML DOM parser installed. Click here to install the HTL DOM parser.
Once it’s been downloaded, extract or unzip the file and create a new directory. Then, copy and paste the simple_html_dom.php file into the directory. Create a new file called name scraper.php and save it inside this directory.
Get access to all the functions in the library
The next step is to get access to all of the PHP library’s functions. Open the scraper.php file in a text editor. Then, include a reference to the simple HTML DOM parser library at the top of your script.
Decide what you want to scrape
Before we start building the scraper, you need to decide what URL you want to scrape. In this PHP web scraper tutorial, we’ll be scraping user reviews for the TV series “Loki” from IMDB.com. The URL of the target web page is https://www.imdb.com/title/tt9140554/reviews?ref_=tt_urv.
Then, determine what you want to extract (i.e., the titles of the reviews, the star ratings, etc.). We’ll be scraping the titles of reviews and the star ratings for this PHP web scraping tutorial.
Building the scraper
Create a DOM object to store the above URL’s content. Create a variable named HTML and give it the value which returns as the DOM object from the file_get_html_() function:
$html = file_get_html(‘https://www.imdb.com/title/tt9140554/reviews?ref_=tt_urv‘, false);
Right-click on a review title and select “inspect” to look at the page’s HTML elements and CSS selectors. Now, look at the Elements tab to identify the HTML elements and CSS selectors that refer to what you want to scrape — which are the titles of the reviews and the star ratings for “Loki” on IMDB.
As you can see in the screenshot, the CSS class selector “review-container” is used for all <div> tags that contain the titles of the reviews and the star ratings.
Continue looking through Elements to spot other relevant class selectors. You’ll find that:
- “Title” refers to the title of the review
- “Lpl-ratings-bar” refers to the number of star ratings
Then, use the following code to extract the data you need with the help of each class selector.
$results = array();
if (!empty($html)) {
$div_class = $title = “”;
$i = 0;
foreach ($html->find(“.review-container”) as $div_class)
//Extract the review title
foreach ($div_class->find(“.title”) as $title) {
$results[$i][‘title’] = $title->plaintext;
}
//Extract the number of star ratings
foreach ($div_class->find(“.ipl-ratings-bar”) as $ratings) {
$results[$i][‘ratings’] = $ratings->plaintext;
}
Code by saasindustries on Github via Zenscrape.
Now, all of the extracted data will be stored in $results. By printing this array, you will get a scraping output you need to convert into an XML element.
You can do this by using the built-in class, SimpleXMLElement, and the following code:
function convertToXML($results, &$xml_user_info){
foreach($results as $key => $value){
if(is_array($results)){
$subnode = $xml_user_info->addChild($key);
foreach ($value as $k=>$v) {
$xml_user_info->addChild(“$k”, $v);
}
}else{
$xml_user_info->addChild(“$key”,htmlspecialchars(“$value”));
}
}
return $xml_user_info->asXML();
$xml_user_info = new SimpleXMLElement(‘<?xml version=\”1.0\”?><root></root>’);
$xml_content = convertToXML($results,$xml_user_info);
Code by saasindustries on Github via Zenscrape.
Now that the data is stored in $xml_content variable, you need to create an XML file and write the data in the $xml_content variable to that XML file:
$xmlFile = ‘MovieReview.xml’;
$handle = fopen($xmlFile, ‘w’) or die(‘Unable to open the file: ‘.$xmlFile);
if(fwrite($handle, $xml_content)) {
echo ‘Successfully written to an XML file.’;
else{
echo ‘Error in file generating’;
Code by saasindustries on Github via Zenscrape.
And you’re done — all the data you wanted to scrape has been stored in an XML file.
Best Proxies to Use for Web Scraping with PHP
To make the most out of your new PHP web scraper, you should use high-quality proxies, such as Rayobyte’s rotating residential proxies. Without premier proxies, your scraper may get blocked and banned from your target websites.
Here are some of the best proxies to use for your PHP web scraper.
Rotating Residential Proxies
With Rayobyte’s rotating residential proxies, your IP will be hidden behind a pool of proxies. These proxies will allow you to scrape many sites at once since they switch at regular intervals. This fools target websites’ anti-scraping tools into thinking that you are many human users instead of one scraping bot using multiple IPs.
By using our rotating residential proxies to protect your PHP scraper, you’ll experience:
- Fewer IP bans. Say goodbye to IP blocks and bans. Our state-of-the-art proxies allow you to reach your scraping goals as effectively and efficiently as possible.
- A personal relationship with us. At Rayobyte, we see our customers as potential partners. Our CEO may even work with you to help you get the most out of our proxies!
- World-class support. Our engineers are available 24/7 to answer any questions you have about our proxy solutions.
- Unmatched commitment to ethics. Proxies can be a controversial topic for many. That’s why Rayobyte aims to set the highest standards for ethical usage and acquisition of proxies. Learn more about our ethical standards regarding proxies here.
Proxy Pilot
Whether you’re using a Rayobyte proxy solution or not, you should consider getting an all-in-one proxy management tool like Proxy Pilot. Free, powerful, and flexible, Proxy Pilot will help your PHP web scraper gather vital information even more efficiently and effectively.
Proxy Pilot can do all of the following (and more):
- Geo-targeting
- Handling cooldown logic
- Detecting bans
- Handling retries
- Providing advanced statistics about your scraping methods and patterns
Easy to use for scrapers of all levels, Proxy Pilot only takes around 10 minutes to set up. Just input a few lines of code and you’re done. Our team will then decrypt your proxy connections and read the HTML pages for ban detection.
Interested? Sign up using this form. To learn more, you can read Proxy Pilot’s documentation here.
Best Practices for PHP Parser or HTML Scraping
Scraping can be a controversial topic, particularly if you’re using proxies. Here’s how you can avoid trouble and be an ethical scraper:
- Respect your target site’s robots.txt file. Play by your target website’s rules. Even if you encounter a rule that seems unfair to you, remember that website owners have the right to limit certain actions.
- Treat the target site gently. With a powerful PHP scraper and rotating residential proxies, you may be tempted to scrape as quickly and frequently as possible. However, that could overload the server. Consider putting random delays between requests and use auto-throttling to limit the crawling speed based on how much the target website can take.
Conclusion
Now that you know what PHP is, how to make a web scraper using PHP, the best proxies to use for scraping, and the best practices for using your web scraper, you’re ready to start using your PHP web scraper.
To protect your PHP scraper from getting banned, consider getting Rayobyte’s rotating residential proxies today. Our proxies will help you reach your scraping goals as efficiently and effectively as possible, especially if you use them in conjunction with our free Proxy Pilot tool. A simple yet comprehensive proxy management tool, Proxy Pilot allows you to detect bans, handle retries, and more.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.