Best Practices For Scraping News Articles With Python
Web scraping is an automated way of quickly collecting vast amounts of data from public websites. You may have various reasons for scraping the web, such as gathering competitor information, prices, or public sentiment.
Scraping news articles is one way to get valuable information that is up to date, and Python makes this even easier. Here’s how to start web scraping news articles with Python and how Rayobyte’s proxies can help.
A Primer for Scraping News Articles
The internet is full of public websites with potentially valuable information for many businesses, organizations, and even private individuals. This public information might reveal trends in prices, shipping costs, keywords usage, or product reviews. Public websites such as social media sites have public comments and conversations between users that could convey sentiments toward brands and products or other information.
News articles convey current information typically for a general audience. They contain a wealth of accurate knowledge and verified information you can scrape and use for various purposes. Information from the media spreads much farther than its source, especially if it goes viral (known as the multiplier effect).
Consider also the relationship between news articles that have an immediate and short-term effect and other stories that are longer-lasting in the news cycle. The value of each type of data depends on the business collecting the data. Longer-lasting news can be more evergreen and may be relevant all year or into the future, whereas shorter-term news may be more relevant for social media marketing or viral content marketing that takes advantage of up-to-date news.
There are different kinds of data in each type of news article. Businesses and organizations can take advantage of the expanded reach and influence of the media and the viral nature of this content by scraping news articles and collecting this data for their use.
The following is just some of the information you can gather when scraping news articles:
- Press releases from businesses about important, relevant topics
- Reviews of newly launched products or services
- News media sharing viral content about brands or products
- Current events that could impact production or distribution channels
- Stock price fluctuations
- Announcements from businesses or governments that could affect the prices of raw materials or other goods and services
- Court rulings or laws that could affect compliance issues
- Public sentiment toward products, brands, services, current events, or social issues that could influence products lines or business strategy
- Company reputation status
- Industry trends
Much of this data can be found on news websites and any webpage URL containing information to educate readers on current events or recent news. You could look at the sites manually or have an employee fill spreadsheets with various data sets, but that would take a tremendous amount of time and effort, especially for the vast amounts of data you may need to collect.
All the data on current events, products and their respective prices, product descriptions, and reviews could fill thousands of rows on a spreadsheet. But you can automate data collection by web scraping news articles and get the data faster. Scraping news articles and other sources of information can be done with a programming language such as Python.
What is the Python programming language?
Python was developed and released in 1991 as an advanced scripting language that is easy and intuitive to use but as powerful as competing languages.
Python is a language that is:
- Multipurpose
- Multiplatform
- High level
- Object-oriented
- Widely used
- Easy to obtain and install
- Free
As a result, Python is one of the most popular and in-demand programming languages. It has a vibrant community that can be helpful for people new to the language or running programs. There are numerous tutorials and videos for almost any Python task, including scraping news articles.
Another reason for Python’s popularity is its extensive libraries. These libraries make advanced tasks, such as scraping news articles, easy and much less labor-intensive than manual methods. The libraries contain many lines of code, so you have much of the work done for you when writing a new program. Some of those libraries make it possible to scrape news articles and save the data to local files for further analysis, as you’ll see later.
How to write and run Python programs
Before you can get started scraping news headlines and articles, you need to learn how to run Python code for web scraping.
To get started using Python, download and install the latest Python version. Using an integrated development environment (IDE) to write and run your code is recommended. One such IDE is Visual Studio Code, which allows you to write, debug, and run code. You can run your program on your local machine or web browser.
Once you’ve installed an IDE, you can write your Python programs. Learning to write a basic Python program is relatively easy compared to other languages.
For example, this is the code to print “Hello World!” as the output when the command to debug has started:
# This program will print “Hello, world!” to the output in debugging
print(‘Hello, world!’)
As you can see in the code above, the language is simple. In Visual Studio, the program can be run after you click the start button and a debugging window appears with the text “Hello, world!” displayed. The output can be displayed in the terminal.
Other ways to write and run Python code are using either the Python shell or the Python Integrated Development and Learning Environment (IDLE). In a Python or IDLE shell, the code can be written and run simply by pressing the Return key. The text “Hello, world!” appears directly below.
Web Scraping News Articles in Python
Scraping news articles can provide valuable data for companies and organizations, but, as mentioned, this can take a lot of time to do manually. This is why businesses use programs written in Python to collect, save, and analyze data from news sites automatically.
Web scraping news articles and other websites requires more complex code than a simple “print” command. But thanks to web scraping libraries such as BeautifulSoup, Requests, Selenium, and others, programs for web scraping are easier to write. These libraries include code for programs that can help you to connect to publicly available websites to scrape and download data automatically.
With the Scrapy library, you can create, operate, and deploy web scrapers in the cloud. These scrapers look at website data for you by sending requests to URLs you’ve defined in the program. It then loops through the data elements from the pages you want using a CSS selector.
Because it can process asynchronous requests, Scrapy is very quick. It’s also collaborative and open-source, making it a good option for a web scraping library, especially for someone without much programming experience. Other libraries, such as Requests, could be just as simple and efficient.
After identifying the type of data you want to scrape, you’ll need to run a Python program to scrape and save the data. These are the steps for scraping the web with Python:
- Download and install Python.
- Open your IDE.
- Import a library such as Scrapy or Requests.
- Select a headless browser to open the web pages without a graphical user interface.
- Define objects such as a page source object and results object.
- Run the page source object through the web scraper class.
- Extract the data from the web pages.
- Export the data into a CSV file or a database.
Using a library for scraping news articles
To scrape news articles, you need to have a program download the website and parse the HTML. In Python, you can take advantage of a library such as Scrapy, BeautifulSoup, Lxml, Pandas, or Requests. Here’s how to use some of these libraries to scrape news articles and save data to local text files or CSV spreadsheet files.
Install Python and libraries
The first step in scraping news articles is to install the latest version of Python, the Pip package installer, as well as a library or libraries to use for web scraping.
Once you have Python installed, you can open a Command Prompt. If using Windows 10, change the directory to your Python location, and install the Scrapy library by entering the following code:
Pip install Scrapy
If you are going to use Requests, enter the following command in the Command Prompt:
Pip install Requests
You can also install multiple libraries at once by adding them with a space in between each library:
Pip install BeautifulSoup4 Lxml Selenium pandas
You may want to install all of the above libraries, as you can use more than one library in different parts of the program to complete different tasks.
Make a new file in an IDE and import libraries
Open your desired IDE and make a new file. The example code below takes advantage of the terminal in Visual Studio Code. At the top of the new file, import the web scraping package you’ve installed:
Import Requests
Write code to connect to news websites to scrape
For scraping news articles, you can create a response object to browse a news site. For example, you may want to browse the text-only version of CNN and save the HTML text data of that page. This guide uses the lite version of CNN for simplicity’s sake, but you can get similar results from Google News web scraping or other news aggregators.
response = requests.get(‘https://lite.cnn.com/’)
You can print the status code to the terminal to see if the page was downloaded successfully. A code of 200 indicates that the website’s contents were successfully downloaded.
print(response.status_code)
Print the entire text contents of the response object:
print(response.text)
So the code up to this point to install the Requests package, create a response object to connect to a news website, and print the text is the following:
import requests
response = requests.get(‘https://lite.cnn.com/’)
print(response.status_code)
print(response.text)
Parse the HTML into a Python object
If you’ve run the code above, you should see the HTML text, including HTML tags, in your terminal if using Visual Code Studio. There isn’t much you can do with that data at this point without parsing the HTML into a Python object.
For that, you can use BeautifulSoup and Lxml. Be sure you have installed these libraries using the Pip command in a command prompt.
Import BeautifulSoup and create response objects and soup objects:
from bs4 import BeautifulSoup
response = requests.get(‘https://lite.cnn.com/’)
soup = BeautifulSoup(response.text, ‘lxml’)
If you wanted to find an HTML tag on the site, you would use a find() command. For example, you may want to find and print the title of the page. Create a title object. Then, print the text in the title tag using the get_text() method.
title = soup.find(‘title’)
print(title.get_text())
Running the above code should result in a text output in the terminal:
Breaking News, Latest News and Videos | CNN
Find and print headlines
On the CNN page, look in the HTML tags to find the class that they’ve assigned for headlines. In this case, the name of the class is “card–lite.” To get and print a list of the headlines on the page, you’ll need to find each item with the class “card–lite” and create a loop that prints each headline.
soup.find(‘small’,class_=”card–lite”) #Finds each headline; use class_ because the word “class” is reserved by Python
headlines = soup.find_all(class_=”card–lite”) #Use soup.find_all() to find multiple items
Create a for loop that prints each headline:
for headline in headlines:
print(headline.get_text())
The full code for scraping news headlines in Python is:
import requests
from bs4 import BeautifulSoup
response = requests.get(‘https://lite.cnn.com/’)
soup = BeautifulSoup(response.text, ‘lxml’)
Use soup.find() to find the headlines with the class “card–lite:”
soup.find(‘small’,class_=”card–lite”)
headlines = soup.find_all(class_=”card–lite”)
for headline in headlines:
print(headline.get_text())
Save headlines to a text file
After running the above code in Visual Studio Code, it will output a list of headlines in the terminal. To save the headlines to a text file, you would need to open a file, such as headlines.txt, and write the text of the headline elements as such:
f = open(“Path/to/headlines.txt”, “w”)
for headline in headlines:
f.write(headline.get_text())
f.close()
Put the path to your text file. If a file does not exist, one will be created.
Save headlines to a CSV spreadsheet file
You may want your data saved to a CSV file that you can open with a spreadsheet program. For this, you would need to import CSV and store the headlines in a list.
import requests
import csv
from bs4 import BeautifulSoup
response = requests.get(‘https://lite.cnn.com/’)
soup = BeautifulSoup(response.text, ‘lxml’)
headlines = soup.find_all(class_=”card–lite”)
Specify the CSV file path. If a file does not exist, it will be created:
csv_file_path = “C:/Path/To/headlines.csv”
Open the file in write mode:
with open(csv_file_path, ‘w’, newline=”, encoding=’utf-8-sig’) as csvfile:
csv_writer = csv.writer(csvfile) # Create a CSV writer object
for headline in headlines:
csv_writer.writerow([headline.get_text()])
The full Python program above uses several libraries, functions, and modules to scrape a news site for headlines and save the headline to separate rows in a CSV file. You can write similar programs to scrape different URL elements and save the data to CSV files for further analysis.
Once you’ve mastered scraping news articles for headlines and saving the data, you can start scraping other HTML elements of news sites, including the article text. You can use programs to collect and sort the content and scan for keywords that may be relevant, such as product information, legal issues, or customer sentiment.
Choose the right web browser when scraping news articles for quicker scraping
Browsers can be run “headless,” meaning they can be run in the background without a graphical user interface, making them load much quicker. You won’t see the visual contents of the websites when using a headless browser. You can use Selenium to run a web driver and may also need a driver for each browser.
For example, you can download a Chrome driver. You can create a driver object after importing Selenium, a web driver, and Chrome:
from selenium.webdriver import Chrome
driver = webdriver.Chrome
Put the address of the site you want to scrape:
driver.get(‘https://lite.cnn.com/’)
Using a proxy when scraping news articles
Websites can detect when you are web scraping news articles and other sites and may block your internet protocol (IP) address if you are scraping news websites from a single IP address in a short period. One possible workaround is using a proxy to scrape websites. Proxies hide your real IP address and switch the IP address your scraper uses so it appears multiple users are making requests.
Proxy servers significantly reduce the possibility of websites blocking you when scraping news articles, especially when you are scraping a lot of data at once. So you can scrape information without worrying about making too many requests or being detected as a bot. When scraping news articles, proxies give you a layer of anonymity, and it’s also possible to bypass censorship and filters.
You can use Python Requests to hide your IP address behind a proxy when scraping news articles. Python Requests allows programmers to send automatic HTTP requests to websites to gather information from the HTML or XML, including page contents, cookies, headers, and other information.
To use a proxy with Requests when scraping news articles, you would:
- Purchase a proxy that has an HTTP address.
- Install the latest version of Python.
- Import Requests.
- Put your proxy server’s address in your code.
- Put your username and password in the code.
- Run your web scraping program.
A proxy code snippet may look something like this:
import requests
proxies = {
#The addresses of your proxy
‘Http…,
‘https…,
}
session = requests.Session()
session.proxies.update(proxies)
For scraping news articles, put the news website you are going to scrape in the session.get() code:
session.get(‘http://WebsiteToBeScraped.com’)
To get proxies to help with scraping news articles, partner with a well-known and trusted proxy distributor such as Rayobyte.
Choose from a range of proxy products, each with different advantages:
- Dedicated data center IPs offer unlimited bandwidth.
- ISP proxies are lightning-fast proxies that get banned less often than rotating data center proxies and other data center proxies.
- Ethically sourced residential proxies hide your IP behind a residential IP.
- Mobile proxies ensure continuous internet connectivity and seamless browsing with the highest success rate of all available proxies.
You can also use Scraping Robot, a web scraping API that takes care of the entire web scraping process without any hassles, including proxy management. Choose the products that will help you the most when scraping news articles, and always feel free to upgrade when your business expands.
How Rayobyte Can Help You Get Started Web Scraping News Articles
Scraping news articles can provide a treasure trove of data that all types of businesses, organizations, and individuals may want to collect and analyze to make informed and profitable decisions. Scraping news articles with Python is a beginner-friendly method that can do most of the hard work for you, thanks to libraries.
When scraping news articles, be sure to protect your anonymity and security to avoid getting blocked. For that, you’ll want to use the best proxies on the market that will rotate your IP address so you can stay scraping without interruption. If you have more questions about web scraping news articles and using proxies, Rayobyte is here to help.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.