Proxies And Python Web Scraping (Why A Proxy Is Required)
If you’ve been web scraping for a while, you already understand how valuable the practice is. Web scraping is simply the best way to collect vast amounts of data from websites in virtually no time. Still, not all web scraping programs are created equal. If you want to produce a web scraper that’s genuinely efficient and effective, sometimes the best strategy is to write the program yourself.
That’s where Python comes in. Python web scraping is fast, efficient, and easy to manage. As long as you have a baseline familiarity with Python as a language, you can use Python to get data from websites in minutes.
This is the ultimate guide on how to scrape data from a website with Python. That means there’s a ton of information here, from the basics to potential problems you may encounter. If you need to skip around to find specific information, you can check out the table of contents below.
Why Scrape Websites with Python?
There are dozens of programming languages that support web scraping programs. So why use Python to scrape web pages? Python is one of the most robust, well-rounded, general-use programming languages in the world. In fact, it’s the second-most common coding language globally. Here’s what you need to know about Python and how it compares to alternative scraping languages.
What is Python?
According to the official Python website, “Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together.”
Basically, Python is a language that prioritizes readable object-oriented code. That makes it the perfect choice for web and app development.
Since web scrapers are applications designed to be used online, Python is a natural fit. Other benefits of Python include:
- Ease of use: Python is free from complicated semi-colons or curly braces. You don’t even need to define datatypes for variables. It’s quick and easy to write.
- Easy to learn: It’s comparatively easy to learn Python. Anyone with a basic understanding of programming can quickly put together functional programs.
- Large libraries: Python’s popularity means that there are a large number of high-quality libraries that you can implement. You don’t need to write everything from scratch.
Despite the many benefits of Python, people do choose to use other languages for web scrapers. Below are comparisons of Python with C# and Java, the other two commonly-used scraping languages. We’ll show how they stack up in different situations.
Python vs. C#
Python benefits:
- User-friendly
- Line-by-line execution
- Extensive library support
- Dynamically-typed
- Easy to learn and write
Python drawbacks:
- No built-in memory safety features
- Slower than C#
C# benefits:
- Very fast
- Built-in Type memory safety tool
- Extensive library support
C# drawbacks:
- Expensive
- Resource-intensive
- Complicated to learn and write
- Statically-typed
C# is a programming language built by Microsoft and designed to support modular programs. C# is descended from the languages C and C++. These languages are incredibly well-known, so there are many programmers already familiar with the code family. However, C# is much less intuitive than Python, so it’s not an excellent choice for anyone who isn’t already familiar with it.
C# can handle massive data loads, which Python can occasionally struggle to accomplish. On the other hand, C# is expensive to implement and requires significant server resources to run. Python, meanwhile, does not. Finally, C# is platform-dependent, only running in Microsoft’s proprietary .NET.
Verdict: If you already know C# and have access to the resources of a large business, it can be a great choice. In any other situation, Python is the way to go.
Python vs. Java
Python benefits:
- User-friendly
- Supports multithreading
- Easy to learn and write
- Dynamically typed
- Supports operator overloading
Python drawbacks:
- Slower than Java
Java benefits:
- Supports multithreading
- Faster than Python
Java drawbacks:
- Statically typed
- Gets very long, takes time to write
- Does not support operator overloading
Java was the first genuinely platform-independent programming language. It’s well-known for web development, coming just behind Python in popularity. It has many similarities to C# but has the benefit of being functional on many platforms. It’s faster than Python but not quite as fast as C#.
However, Java is more complicated to write and learn than Python. While Java is more beginner-friendly than C#, static typing means it’s less friendly than Python. To properly write Java code, your program will become very long very quickly. Finally, unlike Python, it’s not open-source, so libraries can be expensive.
Verdict: If you’re familiar with Java and more comfortable with statically typed languages, then Java is a good choice. Otherwise, Python is still a better solution for most programmers.
How to Scrape Data from a Website with Python
To scrape websites with Python, you need to produce a program that will interact with the pattern of the websites’ HTML. The program will read the HTML, collect the information you need, and print it out in your preferred format.
There’s some prep work you’ll need to do before you can even consider HTML scraping with Python.
Decide what kind of information you want to collect. You can scrape a site for any kind of public-facing data or information.
- Do you want to know product names and prices?
- Do you want to collect product descriptions?
- Are you just looking for phone numbers?
Figure out the data you want to gather before you do anything else.
Inspect the pages and sites you want to scrape to understand how they’re structured. While there are plenty of common themes to website design, no two programmers will build a site the same way. Some may structure URLs with sequential digits, while others will create unique URLs for every page. That will make a significant difference in how you structure your scraper loops later.
The best way to study the site is to visit it in the browser you want to use for scraping. Check out the site with built-in browser dev tools to look for patterns you can use in your scraper. Most critically, check for patterns around the data you want to collect. This way, you can build the program to take advantage of them.
Once you’ve done this preliminary work, it’s time to write your first Python web scraping program.
How to Make a Web Crawler in Python
If you want to make your own web scraper in Python, you’re in luck. Python is an easy language in which to write all kinds of programs. As long as you know the basics, you should be able to follow along with this web scraping tutorial for Python and build a basic one that fits your needs.
0. Install Python
Before you can write anything, you need to install Python on your server or computer. You can check to see if you already have Python by running the following command:
python3 -v
If you have it, the readout will tell you what version of Python you have. Ideally, you want the most recent version, Python 3.8.x. Any version after 3.4.x will work, however.
If you use Windows, make sure you’ve permitted PATH installation. This will make installing libraries later significantly more straightforward. It permits you to use “pip” and “python” commands through Windows Command Prompt, which will be used throughout the tutorial.
1. Choose a coding environment
Python is platform-independent. You can use just about any coding environment you want. If you’re new to programming, it may be a good idea to work with an Integrated Development Environment (IDE) like PyCharm or Visual Code Studio. IDEs are easy to use and make the code easier to read with highlights and accessible UIs. However, if you’re an old hand at Python, you can also use a simple notepad document and save it as a .py file.
2. Install and import Python web scraping libraries
Part of the beauty of Python is the sheer number and quality of libraries you can download. These libraries will save you hundreds of hours of programming stress. The rest of this tutorial assumes that you’ll work with the following four excellent libraries:
Beautiful Soup: This library works with a parser to look at messy HTML and extract meaningful data. It only assists with a parser, and it has no way to make HTTP requests from any site. You need to provide both the parser and the request-making service. Still, Beautiful Soup is the best solution to make sense of HTML and even invalid markup. Install the library from the terminal with the following command:
pip install BeautifulSoup4
You can update the “BeautifulSoup4” to the most current version of Beautiful Soup once new versions are released.
iXML: Another great parsing tool is iXML. It excels at large datasets, but it’s not as great with messy HTML. You can use iXML as a parsing library whenever you need to speedily scrape large, well-written sites. You can install iXML from the terminal with this command:
pip install lxml
Requests: The Requests library is the bedrock of Python web scraping. This library is full of tools that simplify the process of making HTTP requests. It lets you send HTTP GET and POST requests with a single line of code. You can install the library with this command:
pip install requests
Selenium: The Requests library is great at static pages, but it simply doesn’t work on pages that use JavaScript to dynamically fill fields and menus. That’s where the Selenium library comes in. The tool handles the JavaScript rendering process. Once Selenium has forced the page to render the JavaScript code, the newly-filled fields can be scraped with Beautiful Soup. Selenium also has some unique elements that let you program your scraper to appear more human. You can install Selenium with:
pip install selenium
from selenium import webdriver
Selenium also requires specific drivers that you can download here.
Pandas: pandas is a data analysis and manipulation tool that will help your program create important variables and generate two-dimensional data tables. That’s vital if you want to produce organized data. You can add pandas to your library with the following:
import pandas as pd
3. Choose a browser
There are as many browsers as there are programming languages. You may prefer Opera or another obscure browser — but they’re not the best choice for your web scraper. Instead, make sure you have a browser that’s supported by the libraries you need. The best solutions are Chrome, Firefox, and Edge. The rest of the tutorial will assume you’re using Chrome.
Once you’ve chosen a browser, it’s time to add it to your program. This makes use of Selenium:
from selenium.webdriver import Chrome
driver = webdriver.Chrome(executable_path=’c:\path\to\windows\webdriver\executable.exe’)
This is also the time to enter the URL you want to scrape:
driver.get(‘https://WebsiteName.com/page/2’)
4. Define objects and build lists
Now it’s time to create an object in your code and take advantage of Python’s ability to design objects without exact types. You need a page source object and a results object. You can set both to whatever variable you prefer:
target = driver.page_source
results = []
Next, you can run your page source object through the Beautiful Soup class. This will permit it to be analyzed:
BS = BeautifulSoup4(target)
5. Extract data from HTML files
Now it’s time to start collecting data from website HTML. The problem with HTML scraping is that you want to collect a small element from many different places on the page, like product names or usernames. That requires you to scrape each smaller section individually to build the list.
The best command for this is “soup.findAll,” since you can use it to find just about any argument you want with Beautiful Soup. When you’re first building a web scraper, it’s easiest to start searching for attributes because it will help you find classes, including elements like titles and headers.
How do we find them? Simple:
for element in soup.findAll(attrs={‘class’: ‘header1’}):
This will find all HTML classes that are identified as header1. Now you want the program to do something with what it’s found.
results.append(header1.text)
Now you’ve built a “for” loop! The loop will spot all header1s on the page and print that data directly to your list. Make sure you always indent the lines after a loop or you’ll run into errors.
6. Export data
Now it’s finally time to use the pandas library. It will neatly export your data to a .csv file with the following code:
df = pd.DataFrame({‘header1’: results})
df.to_csv(‘headers.csv’, index=False, encoding=’utf-8′)
This code creates the variable “df” and assigns its object into a 2D table. The first column will be titled “header1,” and the results will be listed underneath it. Meanwhile, the line below exports the entire table into a .csv file with appropriate encoding.
The complete program
That’s a complete web scraper! It doesn’t take long at all — which is the beauty of Python. In just a few short lines, you can install and write all the commands you need to print out your information in a single convenient file.
If you’ve never written a program for Python web scraping before, now you understand the basic shape:
pip install BeautifulSoup4
pip install selenium
pip install 1xml
pip install pandas
pip install requests
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(executable_path=’/nix/path/to/webdriver/executable’)
driver.get(‘https://WebsiteName.com/page/2’)
results = []
target = driver.page_source
BS = BeautifulSoup(content)
for element in soup.findAll(attrs={‘class’: ‘header1’}):
results.append(header1.text)
df = pd.DataFrame({‘header1’: results})
df.to_csv(‘headers.csv’, index=False, encoding=’utf-8′)
You can explore and play around with different additions to your scraper. The libraries this tutorial uses can do so much more.
Python Web Scraping Examples
You can use a program like this to accomplish all sorts of scraping tasks. Scraping can be used by businesses and hobbyists alike, so it’s perfect no matter what your data-gathering needs are.
For example, businesses can — and do — use Python web scrapers to collect pricing information from their competitors. These scrapes collect elements like product titles, descriptions, and prices into a single massive .csv file. Companies can analyze a single scrape session to set new prices. They can also perform regular scrapes to check for competitors’ sales and potentially compete with them.
On the other hand, an individual could use a Python scraper to search for sales. If you’re looking for a good deal on something, you could scrape auction sites and retail stores to find the best price. You’d need to look for product names, prices, and shipping details to find the best bargain for your money.
Other scraping examples include:
- Real estate data: You can collect home descriptions, prices, locations, and more to determine property values, good deals, or rental prices.
- Hotel and travel information: You can scrape airline and hotel sites to find open dates, cheap travel times, and other information so you can have the best solution for your trips.
- Social media: You can scrape social media sites for references to brands, concerts, follower counts, comments, and anything else you might find interesting.
- Stock data: If you’re interested in comparing data about specific stocks, you can scrape stock sites for relevant information on a minute-by-minute basis.
Python Web Scraping Best Practices and Tactics
Once you understand the basics of Python web scraping, you can implement some best practices and tactics to make your program better. The following tips are expert-level solutions that will help you get higher quality data in less time.
Scrape multiple URLs at once
Scraping a single URL is like using a bazooka to kill a fly. You have incredible power at the tips of your fingers, so use it well. You can program your web scraper to scrape multiple URLs at once with a simple loop.
You can get a lot of mileage out of a “for” and “while True” loop on many simple sites. Many sites use numbers at the end of the URL to count up different pages. One page might be “www.website.com/page/12,” and the next is “www.website.com/page/13”.
To take advantage of this with a loop, implement the Beautiful Soup library. You can instruct the program to continue looping until the URL no longer counts up.
The code might look like this:
while True:
print(‘—‘, page, ‘—‘)
r = requests.get(url)
soup = BeautifulSoup(r.content, “html.parser”)
for link in soup.find_all(“a”):
print(“<a href=’>%s’>%s</a>” % (link.get(“href”), link.text))
general_data = soup.find_all(‘div’, {‘class’ : ‘title’})
for item in general_data:
print(item.contents[0].text)
print(item.contents[1].text.replace(‘.’,”))
print(item.contents[2].text)
next_page = soup.find(‘a’, {‘class’: ‘next’})
if next_page:
url = next_page.get(‘href’)
page += 1
else:
break # exit `while True`
Implement headless browsers
A headless browser is any browser that doesn’t display images for you to see. The browser’s running, but there’s no screen or window for you to interact with. There’s no UI at all — you interact with it through a command line.
Headless browsers are faster than those that provide a UI. Without the demand of displaying the UI, the browser can load and render pages much quicker. Suppose you’re comfortable with your program and you’re tired of watching it flicker through sites. In that case, you can replace your regular browser with a headless API node library like Puppeteer. This program is the Google Chrome team’s official Chrome headless browser.
You can implement Puppeteer with the line:
npm i puppeteer
Or:
yarn add puppeteer
Both will install Puppeteer and download a recent version of the Chromium browser that it will convert to a headless alternative.
Create a human scraping pattern
You want to make sure your web scraper isn’t identified as a bot. That means you need to do some disguise work. Your scraper will perform tasks in extremely similar time frames and click links in the exact same pixel unless you tell it otherwise. That’s a clear giveaway that a bot — not a person — is visiting. You can prevent that by implementing human scraping patterns within the bot.
To do this, you can create short wait times between pages using the “import time” and “from random import randint.” That will make your scraper take a variable amount of time to visit different pages just like a person would.
You can also adjust how your bot interacts with each page. The command “scrollto()” is a great way to make your scraper move through the page like a person visiting in a browser instead of a bot that can see the entire thing at once.
Create monitoring processes
Many sites only show certain data at specific times. Others update information regularly, and you want to make sure you get the most recent data. In either situation, it’s worthwhile to set up monitoring loops to make your web scraper more accurate.
A monitoring loop is a long-lasting loop that rechecks specified URLs at set intervals. It monitors these URLs for changes and updates. You can easily set up a long-lasting loop with the libraries’ requests, time, and Beautiful Soup.
All you need to do is add the following script to the bottom of your program:
while True:
url = “http://Google.com/”
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, “lxml”)
if str(soup).find(“word”) == 1:
time.sleep(60)
continue
else:
break # exit `while True`
This script will check a URL once a minute as long as a particular word appears on a page. You can check it on longer intervals by replacing the “time.sleep()” number with the number of seconds you want to wait.
Use proxies
The last element of a good web scraper is the use of proxies. A good Python proxy protects your IP address from sites that want to block bots. Many sites will ban IP addresses that send too many requests in a short period of time or that generally seem robotic. If your scraping pattern triggers this type of protection mechanism, your entire scrape can be ruined.
Luckily, the Requests library has a simple tool to add proxies to your scraper:
import requests
proxies = {
‘http’: ‘http://10.10.1.10:3128’,
‘https’: ‘http://10.10.1.10:1080’,
}
session = requests.Session()
session.proxies.update(proxies)
session.get(‘http://WebsiteName.com’)
This code adds a single Python proxy to HTTP and HTTPS requests for an entire session. You can also implement the Proxy Pilot proxy management application. It’s just as easy to add to your web scraping program. It will allow you to manage your proxies without having to change your scraper program every single time.
The 10 Most Common Complications when HTML Scraping with Python
Every web scraper will run into problems. If you’ve written your web scraping program, you’ll probably notice some weird errors popping up in your results. Luckily, these can usually be fixed pretty easily. Here’s how you can spot and address the 10 most common problems you’ll face when web scraping.
1. Asynchronos loading
Asynchronous loading is caused by a Javascript-heavy website that doesn’t present information in the HTML code as you might expect. For example, Twitter’s infinite scrolling is produced through asynchronous loading. The website makes AJAX calls after the page is loaded to continue adding new content as it scrolls. If you’re not getting the data you expect, manually visit the site and use browser dev tools like Chrome DevTools and Firefox Developer Tools to inspect what the server is receiving.
Once you spot asynchronous loading, you can use Selenium to launch your web scraper in a browser instance that acts as a web driver. This will be slower, but it will allow your program to render JavaScript, which you can then scrape. You can also inspect AJAX calls and scrape them by setting an “X-Requested-With” header to mimic AJAX requests.
2. Authentication
Sites that require authentication force you to provide a username and password, a CSRF_TOKEN, or have certain header settings like a specific referrer to access the page. The best way to identify these sites is to manually log in and inspect the header and the information being sent to the server using your build-in browser network tools. Once you know the necessary information to include, you can use Python to add it to your scraper.
3. CAPTCHAs
A CAPTCHA will block your scraper and cause the page to time out. If you’re getting frequent timeout errors, you can manually check the page to look for a CAPTCHA. In web scraping, most CAPTCHAs are triggered by security measures that spot bots. You can avoid CAPTCHAs by using rotating residential proxies in front of your scraper.
4. Header inspections
Many sites choose to filter out suspicious header requests based on the likelihood that the user-agent they include is a bot. You can spoof the user-agent field to avoid signaling that you’re a bot. You should rotate your IP address in sync to avoid throwing up red flags for your proxies or user-agent.
5. Honeypots
A honeypot is a subtle trap for bots. It’s a link hidden in a site that’s invisible to anyone looking at the fully rendered webpage. The honeypot is instead placed under a CSS element with display: none. When the scraper interacts with that link, it lets the site know that it’s a bot, and the site can then block it. You can avoid honeypots by programming your scraper to avoid crawling anything in a CSS element that isn’t set to display on the page.
6. iframe tags
An iframe tag is a piece of a website that’s rendered entirely from an external source. Scraping the iframe will only yield an error or the code running the iframe. You can scrape iframes by taking a two-step approach. First, request the page on which the iframe appears to collect its src attribute, then set the src attribute as the target for a scrape of its own
7. Pattern detection
Some sites with anti-crawling programs will look for defined patterns in visitor activity. These patterns may include the timing of clicks, the location of clicks, and more. You can avoid this by integrating auto-throttle extensions like the one offered by Scrapy to randomize the scraper’s browsing pattern and the time between each action.
8. Rate-request analysis
One of the most common signs that a site visitor is a bot and not a human is an extremely high number of requests in a short timeframe. Auto-throttles can also help you slow down your web scraper just enough to look realistic.
9. Reirects
A redirect sends a site visitor to a new page automatically by returning a 3XX response code. If you’re receiving redirects, implementing libraries like requests or Scrapy’s RedirectMiddleware will help your scraper follow the redirect to the actual goal page.
10. Unstructured HTML
The most challenging problem of all comes from unstructured HTML. There are two reasons for unstructured HTML: dynamic, server-side CSS classes and attributes, and bad programming. Either way, there’s no pattern you can program the scraper to follow. In this case, the only solution is to start using regular expressions or complex XPath queries to hack your way through the mess.
The Best Python Web Proxy for Web Scraping
Proxies can solve many of your web scraping problems. From helping prevent scraping detection to keeping your IP from being banned, a good proxy is essential. But what kind of Python web proxy is best?
Rayobyte’s rotating residential proxies are your best solution. A rotating residential proxy is an IP address that makes your traffic look like it’s coming from a personal residence. The proxy automatically rotates — or swaps in — a new Python web proxy regularly.
That has two effects:
- Most sites are hesitant to block traffic that looks like it’s coming from a real home because of the risk of blocking a human. That means residential proxies are less likely to be blocked.
- Rotating proxies mean you’re never using one proxy too long and getting it permanently banned. You can buy a collection of rotating residential proxies from Rayobyte and keep them all safe by relying on their native rotation.
That’s not the only way that Rayobyte proxies can help. Part of the joy of Python is the number of libraries it contains that handle complex tasks for you. You can extend that beyond your web scraper and integrate it into your proxies by using a program like Proxy Pilot from Rayobyte. This free Python web proxy management solution can help you take care of proxy retries, rotation logic, and cooldown logic without needing to write the systems yourself.
All Rayobyte residential proxies have Proxy Pilot built right in, which means they’re optimized for web scraping. You don’t need to install anything extra when you’re working with Rayobyte’s rotating residential proxies. You can simply let them do the work for you.
Conclusion
Now that you’ve learned why Python is the best language for most people to write web scrapers with, the preliminary research you need to do to write a worthwhile program, how to write a Python web scraping program, and the common problems in web scraping, you’re prepared to face anything that your Python journey may throw at you. Take the final step and make your scraper the best it can be by implementing Rayobyte proxies.
You’ve put in all the work to make sure your scraper is the best it can be. Now, make sure that you support it with top-tier proxies that will keep your IP address safe and your scraper running. You can do all that and more with Rayobyte rotating residential proxies today.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.