Python Web Scraping Examples
You want to build a web scraper to capture valuable information on various websites. You know the value of it. But how do you do it? Python is one of the best computer languages for building a web scraper, and it is easy for even those without a lot of coding experience to learn it. Yet, a Python scrape website example can help you understand the details and processes involved.
In this Python scraper example, we will show you how simple and powerful Python can be at extracting valuable data from websites. We will help you learn to do this using the requests library, fetching HTML content from the website you select, and then parsing that information using Beautiful Soup. This allows you to capture the very specific information you need to research.
We will also provide you with some strategies for using Selenium for dynamic content and how to handle factors like headers, pagination, and storing your data in a database or CSV. Ready to learn?
Here’s Where to Start First: Get Up to Speed
There is a lot to learn through this process, so we encourage you to read some of our other tools that can help you get up to speed on the details. We encourage you to read our “How to Scape the Web Using Python Requests” tutorial as a first step. You can also learn more in-depth strategies in Advanced Web Scraping in Python.
We offer a range of tools to help you. Additionally, it is well worth getting started now with using a proxy for web scraping. Proxies are a component of this process because they protect your identity while you are navigating the internet. We encourage you to set up a proxy for web scraping now and then maintain it throughout the process. For example, use our Proxies and Python Web Scraping (Why a Proxy Is Required) guide.
Python Scraper Example: Where to Get Started
Before we provide you with the Python web scraper example you need, here are some basics you need to know. The process for building a web scraper includes several steps:
- Ensure you have the most up-to-date version of Python available to you to use, have it downloaded, and be ready to go.
- You will also need to have several Python libraries installed and ready to go. These libraries provide you with the tools to plug in code to build your web scraper. Read our articles on each of them. Install Beautiful Soup, lxml, Selenium, and requests to get started.
- Find HTML elements that contain the information you need
- Save the scaped data to a database or CSV.
Now, let’s build out a Python scrape website example to show you how to create your own scraper.
Requests library: You need to set HTTP requests to the website to capture the information you need. To do that, you need to use the Requests library using POST and GET requests, which typically contain the information you need. The requests library makes it possible for you to do this. Start with getting to the terminal and then using the following command:
python -m pip install requests
You can then use the GET and POST requests within your code to send the messages to the website for your needs. Here is a Python screen scraping example for requests:
import requests response = requests.get('https://Rayobyte.com/') print(response.text)
Beautiful Soup: The next step in this web scraping with Python example is to use Beautiful Soup, another library, along with a parser to extract the data you need from HTML. It can also use this data to turn an invalid markup into a parse tree. Start with getting Beautiful Soup by using this:
pip install beautifulsoup4
You will need to use html.parser to help facilitate this process. In the following example, we use this parser, which is also a part of the Python Standard library, to parse information.
To get HTML using requests, you will use this type of code:
import requests url = 'https://Rayobyte.com/blog' response = requests.get(url)
Now, let’s say we want to find the element – the target of this search. You can use this web scraper Python example to find the element:
import requests from bs4 import BeautifulSoup url = 'https://Rayobyte.com’ response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') print(soup.title)
If you have followed that Python scrape website example, you should get the following title:
<title>Rayobyte Blog | Rayobyte</title>
You can then modify the requests you send to include the specific information you need. Use the find_all() bit of code to narrow down the information. You can also use the more advanced tools available to help you navigate each of the requests you need to send.
LXML: The next component to this web scraper Python example is to focus on parsing. Parsing is critical for web scraping, but it is not easy to do – by any means. With lxml, you have a very powerful and easy-to-use tool that will parse the information you need. It works with both HTML and XML files and can extract the information necessary from huge amounts of data or datasets.
There are a few factors to keep in mind here. For example, HTML impacts its ability to parse well. You will need to use it properly to achieve your objectives. To get started, you will need to get lxml with this command:
pip install lxml
This will contain the HTML module. That will then work with HTML (of course!) You will need to have an HTML string first, though, which you can use the requests library to find. This is where it gets a bit confusing, but when you pull it all together, such as in this Python web scraping example, it really becomes a fast and efficient process.
Here’s what you need to do now. Once you have HTML available, the tree is built using the fromString method. To do that, use this code:
import requests from lxml import html url = 'https://rayobyte.com/blog' response = requests.get(url) tree = html.fromstring(response.text)
You will find a lot of benefits in using this tool, but there are a few more steps that can significantly enhance the outcome of your project.
Selenium: Dynamic pages can be one of the most important factors to consider when it comes to web scraping. Many of today’s websites contain dynamic content (even our website does!) That helps to make the content more engaging and beneficial to users. However, it makes web scraping more challenging. The next part of this Python scrape website example is to use Selenium, another Python library, to get around dynamic content.
Selenium is an open-source browser automation tool that will automate a variety of the tasks necessary to navigate dynamic pages. That includes logging into websites or capturing information to answer questions. It is also one of the best ways to avoid CAPTCHA, those difficult boxes and tests that aim to prevent you from getting beyond the page with a bot.
Install it using this command:
pip install Selenium
You will then need to use one of these Python scrape website examples to help you modify the application to work on the browser you are using. Most commonly, people use Chrome, so here is the example code for Chrome:
from selenium import webdriver
from Selenium.webdriver.common.by import By
driver = webdriver.Chrome()
We can now use the GET request to navigate these pages. Here is an example:
driver.get(‘https://rayobyte.com/blog’)
Now, here is another web scraping Python example to try out. Let’s say you are going to use Selenium with CSS selectors and XPath to pull out or extract elements from data. Our objective in this example is to capture all of the titles on our blog – have you read them all? Certainly worth it!
Using a CSS selector, we could use this code:
blog_titles = driver.find_elements(By.CSS_SELECTOR, 'a.e1dscegp1') for title in blog_titles: print(title.text) driver.quit() # closing the browser
One caveat about using Selenium for this type of project is that it will slow down the process. That is because the code has to first execute the JavaScript code on every page. It cannot move beyond this to parse until it takes that step.
If you are trying to navigate a huge amount of data, then you may find this to be a bit slow. For most other web scraping projects, though, there is no need to do more than use Selenium.
Breaking Down the Tools to Facilitate Web Scraping in Python
Now that we have the basics and Python scrape website examples, we can take a closer look at some of the core details that you need to really produce the results you need.
Handling Pagination and Dynamic Content: One of the struggles that many have when creating a web scraper is being able to navigate dynamic content. We have already mentioned this a bit, but let’s talk about other tasks. For example, what happens when you need multiple pages of data – and not just a simple website URL?
You will be able to use Selenium to help you overcome this for pagination. Use Selenium for web scraping involving:
- Delayed content, such as data that takes a few seconds to show up on the page before it actually displays
- JavaScript websites, including any website that is heavily reliant on JavaScript
- JavaScript blocks, which some sites use.
How to Pick a URL: Another component of this process is selecting a URL. In our web scraping examples in Python, we have provided a range of specific URLs to our blog, but there are a few key factors to remember when choosing any URL to include:
- Watch out for hidden JavaScript elements. If there are elements provided, the simple methods we are providing to you here may not work in the way you need them to work.
- Image scraping requires more extensive processes. You can get an example using our Seleium guide. Web scraping images with Python takes a bit more detail to make it a successful process for you.
- Make sure you follow the rules. Our guide here and any other on this website is meant to provide you with the tools you need to scrape content from the web in an ethical manner. You should use it only for public data, and you should never overstep on third-party rights. Be sure to read the terms and conditions of the website you are using.
Exporting Your Data
Now that you have worked through most of the tasks necessary for web scraping with Python, our next objective is to do something with that data. The best option is to export the data utilizing a CSV file.
Before moving on to that process, it is important to check your data at this point. You want to make sure you are getting the data assigned to the right object in this process so that it moves properly. Use the PRINT function for this:
for x in results:
print(x) And then: print(results) Now, when you remove the PRINT loop, you will be able to move the data to the CSV file. You can see how to do that here: df = pd.DataFrame({'Names': results}) df.to_csv('names.csv', index=False, encoding='utf-8')
Try Out This Python Scrape Website Example
The following adds to all of the web scraping examples in Python we have used (and adds a few more specific elements to the process that you can learn more about in our advanced tutorials.) Update this to the specs you need to build the type of scraping you desire.
import requests from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver import ChromeOptions import pandas as pd # Generate 5 URLs of search results. pages = ['https://sandbox.rayobyte.com/products?page=' + str(i) for i in range(1, 6)] # Crawl all URLs and extract each product's URL. product_urls = [] for page in pages: print(f'Crawling page \033[38;5;120m{page}\033[0m') response = requests.get(page) soup = BeautifulSoup(response.text, 'lxml') for product in soup.select('.product-card'): href = product.find('a').get('href') product_urls.append('https://sandbox.rayobyte.com' + href) print(f'\nFound \033[38;5;229m{len(product_urls)}\033[0m product URLs.') # Initiliaze a Chrome browser without its GUI. options = ChromeOptions() options.add_argument('--headless=new') driver = webdriver.Chrome(options=options) # Scrape all product URLs and parse each product's data. products = [] for i, url in enumerate(product_urls, 1): print(f'Scraping URL \033[1;34m{i}\033[0m/{len(product_urls)}.', end='\r') driver.get(url) soup = BeautifulSoup(driver.page_source, 'lxml') info = soup.select_one('.brand-wrapper') product_data = { 'Title': soup.find('h2').get_text(), 'Price': soup.select_one('.price').get_text(), 'Availability': soup.select_one('.availability').get_text(), 'Stars': len(soup.select('.star-rating > svg')), 'Description': soup.select_one('.description').get_text(), 'Genres': ', '.join([genre.get_text().strip() for genre in soup.select('.genre')]), 'Developer': info.select_one('.brand.developer').get_text().replace('Developer:', '').strip() if info else None, 'Platform': info.select_one('.game-platform').get_text() if info and info.select_one('.game-platform') else None, 'Type': info.select('span')[-1].get_text().replace('Type:', '').strip() if info else None } # Append each product's data to a list. products.append(product_data) driver.quit() # Save results to a CSV file. df = pd.DataFrame(products) df.to_csv('products.csv', index=False, encoding='utf-8') print('\n\n\033[32mDone!\033[0m Products saved to a CSV file.')
Ready to Get Started?
To help you get started, learn more about proxies for web scraping (we strongly recommend this step). At Rayobyte, we aim to educate you about all of your options. Use this Python scrape website example to help you get started building your own.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.