How to scrape news headlines from a news aggregator website?

Kire Lea · 2024-12-18T06:44:03+00:00

Scraping news headlines is a useful way to gather information for research or personal analysis. News aggregator websites often have structured layouts where headlines are stored in consistent HTML elements, making them relatively easy to scrape. Start by inspecting the website to locate where headlines are stored—usually within tags like div, span, or a with specific classes. For static websites, Python’s requests and BeautifulSoup libraries can be used to fetch and parse the HTML. However, for dynamic websites where headlines are loaded via JavaScript, tools like Selenium or Puppeteer are more suitable.Here’s an example using BeautifulSoup to scrape headlines:import requests from bs4 import BeautifulSoupurl "https://example.com/news"headers {"User-Agent": "Mozilla/5.0"}response requests.get(url, headersheaders)if response.status_code 200: soup BeautifulSoup(response.content, "html.parser") headlines soup.find_all("h2", class_"headline") for idx, headline in enumerate(headlines, 1): print(f"{idx}. {headline.text.strip()}")else: print("Failed to fetch the news page.")When dealing with JavaScript-rendered content, Selenium can simulate user interactions and load all the headlines before extraction. Additionally, if the site uses an API for fetching data, using that API can save time and improve reliability. How do you handle scraping when the website structure changes frequently?

General Web Scraping

How to scrape news headlines from a news aggregator website?

Posted by Kire Lea on 12/18/2024 at 6:44 am
Scraping news headlines is a useful way to gather information for research or personal analysis. News aggregator websites often have structured layouts where headlines are stored in consistent HTML elements, making them relatively easy to scrape. Start by inspecting the website to locate where headlines are stored—usually within tags like div, span, or a with specific classes. For static websites, Python’s requests and BeautifulSoup libraries can be used to fetch and parse the HTML. However, for dynamic websites where headlines are loaded via JavaScript, tools like Selenium or Puppeteer are more suitable.
Here’s an example using BeautifulSoup to scrape headlines:
```
import requests
from bs4 import BeautifulSoup
url = "https://example.com/news"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")
    headlines = soup.find_all("h2", class_="headline")
    for idx, headline in enumerate(headlines, 1):
        print(f"{idx}. {headline.text.strip()}")
else:
    print("Failed to fetch the news page.")
```
When dealing with JavaScript-rendered content, Selenium can simulate user interactions and load all the headlines before extraction. Additionally, if the site uses an API for fetching data, using that API can save time and improve reliability. How do you handle scraping when the website structure changes frequently?
Michael Woo replied 1 month, 1 week ago 6 Members · 5 Replies
5 Replies

Rhea Erika

Member
12/20/2024 at 1:08 pm

One way I deal with frequent website changes is by building a flexible scraper that uses CSS selectors instead of hardcoding tags or classes. This approach ensures the scraper is easier to update when the website layout changes.
Martyn Ramadan

Member
01/03/2025 at 7:18 am

For JavaScript-heavy sites, I prefer using Puppeteer over Selenium. It’s faster and more stable, especially for websites with a lot of dynamic elements like news aggregators.
Sultan Miela

Member
01/20/2025 at 1:49 pm

When scraping headlines, I always add error handling for cases where the expected tags are missing or the page fails to load. This ensures the script doesn’t crash unexpectedly.
annitaz

Member
01/20/2025 at 8:57 pm

which proxy is the best for outlier
- Michael Woo
  
  Administrator
  02/21/2025 at 3:12 pm
  
  Normally, mobile proxies or residential proxies would work best, as these are purely actual mobile or residential IPs – how is your scraping project going?

How to scrape news headlines from a news aggregator website?

Rhea Erika

Martyn Ramadan

Sultan Miela

annitaz

Michael Woo