All Courses
Scraping

What is a Simple HTML DOM?

Once installed, you can start parsing and extracting data from HTML content. The next step is to find and select the respective element from the larger HTML DOM text.

HTML DOM Select Element Code

With HTML Simple Dom, there are many ways to select an element, namely by the most common HTML Dom Element properties: tag name, class or ID. 

1. Selecting by Tag Name

foreach ($html->find('p') as $paragraph) {
    echo $paragraph->plaintext . "<br>";
}

This retrieves all <p> elements and outputs their text.

2. Selecting by Class

$elements = $html->find('.example-class');
foreach ($elements as $element) {
    echo $element->plaintext . "<br>";
}

Use a period . to target classes.

3. Selecting by ID

$element = $html->find('#unique-id', 0);
echo $element ? $element->plaintext : "Element not found.";

Use a hash # to target IDs.

Extracting Specific Data

Now we’ve found the element, we can extract the relevant data that we want to scrape. Similar to finding elements, we do this by extracting the specific data we need. 

Extracting Titles

$title = $html->find('title', 0);
echo $title ? $title->plaintext : "No title found.";

Extracting Links

foreach ($html->find('a') as $link) {
    echo 'Link text: ' . $link->plaintext . "<br>";
    echo 'URL: ' . $link->href . "<br><br>";
}

Extracting Images

foreach ($html->find('img') as $image) {
    echo 'Image source: ' . $image->src . "<br>";
    echo 'Alt text: ' . $image->alt . "<br><br>";
}

Web Scraping Example: Prices

Knowing how to explore and extract html dom element properties is great, but let’s put it into a web scraping context. When scraping and parsing the DOM, you’ll often be looking for something specific. 

How this works will always vary depending on the website you are scraping, but let’s use prices as an example.

Example HTML snippet:

<div class="product">
  <span class="price">$19.99</span>
</div>

Extraction code:

$price = $html->find('.price', 0);
echo $price ? 'Price: ' . $price->plaintext : "Price not found.";

This locates the first element with the class "price" and outputs its text content.

Extracting Multiple Prices

foreach ($html->find('.price') as $price) {
    echo 'Price: ' . trim($price->plaintext) . "<br>";
}

This retrieves and prints all price elements from a page, useful for product listings.

Handling Variations in Price Markup

Sometimes, prices may be inside nested elements:

<div class="product-price">
  <span class="amount">$24.99</span>
</div>

Extraction code:

$price = $html->find('.product-price .amount', 0);
echo $price ? 'Price: ' . $price->plaintext : "Price not found.";

Using nested selectors ensures precise targeting of elements within complex HTML structures.

Best Practices for Simple HTML DOM in PHP

  • Respect Website Terms: Always check and adhere to a website's terms of service and robots.txt file before scraping.
  • Use Appropriate Delays: Introduce delays between requests to avoid overwhelming target servers and being blocked.
  • Handle Errors Gracefully: Check for null values and broken elements to prevent script failures.
  • Clean Extracted Data: Use functions like trim() and html_entity_decode() to clean and format scraped content.
  • Limit Requests: Avoid unnecessary requests by targeting only the needed elements.
  • Handle Dynamic Content: Be aware that Simple HTML DOM cannot process JavaScript-rendered content; consider alternative tools if needed.
  • Use User-Agent Headers: Mimic real browser headers when necessary to avoid detection.
  • Cache Responses: Store HTML locally during development to minimize repeated server requests.
  • Monitor Changes: Websites change structure over time—regularly update your scraping logic.
  • Test with Sample Data: Start with static HTML samples before moving to live scraping.

Join Our Community!

Our community is here to support your growth, so why wait? Join now and let’s build together!

ArrowArrow
Try Rayobyte proxies for all your scraping needs
Explore Now

See What Makes Rayobyte Special For Yourself!