-
How to scrape news headlines from a news aggregator website?
Scraping news headlines is a useful way to gather information for research or personal analysis. News aggregator websites often have structured layouts where headlines are stored in consistent HTML elements, making them relatively easy to scrape. Start by inspecting the website to locate where headlines are stored—usually within tags like div, span, or a with specific classes. For static websites, Python’s requests and BeautifulSoup libraries can be used to fetch and parse the HTML. However, for dynamic websites where headlines are loaded via JavaScript, tools like Selenium or Puppeteer are more suitable.
Here’s an example using BeautifulSoup to scrape headlines:import requests from bs4 import BeautifulSoup url = "https://example.com/news" headers = {"User-Agent": "Mozilla/5.0"} response = requests.get(url, headers=headers) if response.status_code == 200: soup = BeautifulSoup(response.content, "html.parser") headlines = soup.find_all("h2", class_="headline") for idx, headline in enumerate(headlines, 1): print(f"{idx}. {headline.text.strip()}") else: print("Failed to fetch the news page.")
When dealing with JavaScript-rendered content, Selenium can simulate user interactions and load all the headlines before extraction. Additionally, if the site uses an API for fetching data, using that API can save time and improve reliability. How do you handle scraping when the website structure changes frequently?
Log in to reply.