The Ultimate Guide To Use Python & Selenium To Get Around CAPTCHA
Every web scraping job has the same goal: to gather enough data so that businesses can improve their products and services. Done right, web scraping yields the data-driven insights that companies need to stand out from the competition. But as you know, web scrapers face a lot of obstacles to getting their job done.
CAPTCHA is one of the most common challenges faced by web scrapers, and it’s made plenty of people tear their hair out in the past.
Good news, though – this article will explain exactly how to handle CAPTCHA in Selenium. We’ll tell you all of the steps involved, and we’ll lay out a few methods that cover just about every variation of CAPTCHA. We’ll also provide a tutorial on using the undetected-chromedriver to bypass CAPTCHA.
Keep reading to learn more about how to handle CAPTCHA in Selenium and Python.
What Is Captcha?
CAPTCHA (or Captcha) stands for Completely Automated Public Turing Test to Tell Computers and Humans Apart. It’s a security measure that web designers install as a way of preventing bots from accessing their computing services, spamming their websites, or collecting private information.
Captcha is designed to spot the difference between automated programs, and human beings. Over the years, web designers have gotten a lot more sophisticated, and Captcha has evolved. Today, there are many different kinds of Captcha and Re-Captcha around.
Here are the most common kinds of Captcha:
- Captcha with distorted letters
- Captcha with images
- “I am not a robot” Captcha
- Hidden, or background Captcha
Let’s take a closer look at what each of these different Captchas looks like.
Captcha with distorted letters
This is the classic form of Captcha. In order to pass the test, users have to correctly identify and re-type a series of distorted letters. Humans can usually read the letters easily, but it’s a challenge for automated programs.
Captcha with distorted letters is, in fact, the only “true” Captcha. The other forms are technically called Re-Captcha. These are tests designed and distributed by Google. Re-Captcha works on the same principle as Captcha, but it uses real-world, blurry images instead of distortion.
In this article, we’ll refer to all of the Captcha and Re-Captcha as simply Captcha.
Captcha with images
A more recent form of Captcha asks users to look at a series of squares and select the squares that contain a certain type of image. Users might be asked to click on all the squares that contain a boat, or a bridge, for example.
The images are deliberately blurred. This makes it very difficult for a automated programs to correctly identify images. However, it’s easy for humans to spot the images, since humans are used to looking at images under all kinds of settings, blurry or not.
The “I am not a robot” Captcha
On the surface, the “I am not a robot” Captcha simply asks users to check a box. But in reality, the test analyzes the way a user’s cursor moves on its way to the checkbox. Human users move differently from automated programs, and the test is sophisticated enough to pick up on the difference.
The test may also look at the user’s cookies and their device history as a clue to determine whether it’s a automated programs or a human.
Hidden Captcha
Some of the newest forms of Captcha don’t look like tests at all. They don’t ask users to decipher text, study blurry images, or click a box. Instead, these new tests work by analyzing user cookies, device history, and user activity. That’s often enough to tell whether a automated programs or a human is browsing the website.
Now that we’ve established exactly what Captcha is, it’s time to talk about how to handle Captcha in Selenium and Python.
Why Is Captcha a Problem?
Captcha is designed to prevent automated programs from crawling websites and autofilling online forms. As you know, Captcha is a major challenge for webscrapers. But it’s also an obstacle for people in other circumstances, including:
- Software engineers using automated tools to test new applications
- Consumers using autofill to submit their payment information at an ecommerce point of sale
- Businesses using automated processes to speed up their workflows
Captcha can get in the way of people who just want to complete simple online processes. Fortunately, there are plenty of ways to solve how to handle Captcha in Selenium and Python. Let’s take a look at some of those methods.
Can Captcha Be Automated Using Selenium?
It’s tempting to try and find a way to automate Captcha using selenium. It is, in fact, possible to automate Captcha using a combination of Optical Character Recognition (OCR) and a complex algorithm. However, this approach is very difficult to implement and is not recommended.
Instead, we recommend trying another approach to avoid Captcha Selenium. We’ll get into some of the best anti Captcha Python approaches below.
How To Bypass Captcha in Python and Selenium
There are a few different approaches users can take to bypass Captcha. Here are some of the best methods.
Bypassing Captcha using Undetected ChromeDriver
This is one of the surest methods of bypassing Captcha. We’ll walk you through the steps so that you can understand how to handle Captcha in Selenium and Python. It’s a simple process.
First, though, let’s make sure we have our terms straight.
What is Undetected ChromeDriver?
Selenium comes with a built-in WebDriver to communicate test scripts with Chrome and navigate between web pages.
However, Selenium’s WebDriver isn’t very good at handling Captcha. It’s also not great for handling other anti-scraping measures. That’s why many web scrapers are using Undetected ChromeDriver to navigate Chrome.
Undetected Chrome is an optimized version of WebDriver for Selenium. It excels at getting around Captcha. If you want to handle Captcha in Selenium, this is a great solution.
Step-by-step guide to handling Captcha with Undetected ChromeDriver
Step one: Preparation
Before you can tackle how to handle Captcha in Selenium, you’ll need to make sure that you have all the necessary prerequisites for this Python Captcha solver.
You must:
- Install Selenium, if you don’t already have it installed.
- Install the latest version of Python – the driver will only work with Python 3.6 or later versions
- Install Chrome
Step two: Install Undetected-ChromeDriver
You’ll need to install undetected-chromedriver and requests module using the following pip command:
pip install undetected-chromedriver requests
Step three: Import libraries for Undetected ChromeDriver
After you install undetected-chromedriver, you are ready to import it. You can follow these commands:
import undetected_chromedriver as webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(“–headless”)
chrome_options.add_argument(“–use_subprocess”)
browser = webdriver.Chrome(options=chrome_options)
When you’re done importing, a Chrome window should automatically open in the background, in headless mode.
Note that there are sometimes glitches in the importing process. You may need to create and activate the virtual environment before installation. You can find a tutorial on correcting some of the common errors.
Step four: Visit the website you want to scrape
Use Undetected ChromeDriver to navigate to the webpage you want to scrape data from. We’ll call it the target website. Use this command:
browser.get(“targetwebpage”)
Step five: Screenshot the page
Get a screenshot of your target website. The screenshot serves to verify that the page loaded correctly without Captcha and that Undetected ChromeDriver has rendered the Javascript files.
Use Selenium’s save_screenshot method:
browser.save_screenshot(“screenshot.png”)
Congratulations! You have successfully used Undetected ChromeDriver to bypass Captcha. You are now ready to start webscraping.
Benefits of using Undetected ChromeDriver to bypass Captcha
Using Undetected ChromeDriver to handle Captcha in Selenium is generally very effective.
Unlike some other solutions, Undetected ChromeDriver doesn’t require human intervention to bypass Selenium Captcha. If you’re already familiar with Python and Selenium, then this may be a good solution for you.
People who are not experienced computer programmers may experience a steeper learning curve if they try to use Undetected ChromeDriver. However, even for someone totally new to computer languages, this project is feasible if they can spend some time getting themselves up to speed.
Disadvantages of using Undetected ChromeDriver to bypass Captcha
The big issue with Undetected ChromeDriver is that it doesn’t block your IP address. This can lead to some problems. Here are a few of the most important ones, along with solutions where relevant.
Problem: Undetected ChromeDriver is not always effective for large-scale scraping.
If you’re carrying out a big scraping project, the web page’s automated detection tools will probably spot the high volume of server requests coming from your IP results. It’s likely that your IP will be blocked.
Solution: Use a rotating proxy. Rayobyte can help you find the right tools to make this a breeze.
Problem: Undetected ChromeDriver can’t fix a low reputation score.
Because you can’t hide your IP address, you also can’t hide your reputation score. If yours is low, you’ll probably get a Captcha. You may also be automatically blocked if you’re using a headless browser.
Solution: There’s only a partial solution to this problem. You can set headless=False to avoid automatically being blocked. This solves the issue of being blocked for using a headless browser. However, it doesn’t resolve the problem of running into Captchas if you have a low reputation score.
Other Methods of Bypassing Captcha
It’s useful to have a few different approaches for handling Captcha in Selenium. Here are a few of your options if you want to know how to handle Captcha in Selenium.
1. A Semi-Automated Approach To Bypassing Captcha
This approach is unusual because it uses a blend of automation and human intervention. Sometimes, that’s the best way to get around Captcha.
The goal here is to automatically freeze your web scraping script as soon as a Captcha is detected. Freezing the script for just 10 seconds or so protects you from “race condition,” which is what would otherwise happen when an automated function encounters Captcha.
If you don’t freeze the page, the automated script will race ahead of the Captcha test, trying to solve the test before the Captcha element is fully loaded. As a result, you won’t pass the test.
If you freeze the script, you’ll give the Captcha time to load. At that point, you can just solve the Captcha and then let your script get back to work. You might even want to build in an alarm to notify you whenever a Captcha needs to be solved.
You can find more details about this approach and a step-by-step tutorial.
Benefits of using a semi-automated solution to bypass Captcha
When it’s done correctly, the semi-automated solution can be a very effective way of bypassing Captcha. After all, human beings are naturally very good at solving Captcha tests. Our eyes are used to interpreting distorted text and blurry images.
If you have the coding skills to automatically freeze your script whenever it detects a Captcha, then this may be a good solution for you. Of course, you’ll also need to have the time and availability to solve Captchas whenever your script detects them.
Drawbacks of using a semi-automated solution to bypass Captcha
Using this solution requires a human being to be on hand and prepared to intervene at regular intervals. This is not a fully automated process, which means that it can’t simply run in the background.
People with time constraints, or anyone attempting to do more than one web scraping project concurrently, will not like this approach.
Fortunately, there are other options available to solve how to handle Captcha in Selenium.
2. Using Rotating IPs To Bypass Captcha
Some websites use a hidden form of Captcha to identify bot activity. If multiple requests are all coming from a single IP address, those websites determine that it’s automated programs activity, and they block the site.
You can solve this problem by using rotating IP addresses for your webscraping project.
The benefits of rotating IPs
You may be able to successfully avoid certain Captchas with this approach. It’s especially effective for the kind of Captcha that operates in the background rather than presenting users with a direct test.
Disadvantages of using rotating IPs
Cost is the biggest potential downside to using rotating IPs. Although it’s possible to use free proxies, most of them are not very effective. It’s a better bet to use a custom-built proxy server that can rotate your IP address on a regular basis.
3. Using Machine Learning To Bypass Captcha
It’s possible to bypass Captcha – or at least, to bypass the text and image-based Captchas – by using a machine learning algorithm. This is quite a complex process, so we’ll break it into a step-by-step guide. You can also read more about the process.
Step one: Prepare a dataset
The first step is to collect a dataset to train the machine learning algorithm. Look for a dataset with a large number of Captcha images. There should be as many different images as possible so that the algorithm can learn to effectively recognize them.
Open source datasets, like Kaggle, have collections of Captcha images that you can use to train your machine learning model. You’ll also need to prepare the images by cropping them, resizing them, and adjusting the Greyscale so that your algorithm can recognize them more easily. Tools like OpenCV can help prepare images.
Once you’ve collected and prepared your images, it’s time to train the machine learning model.
Step two: Train the machine learning model
There are a few different methods possible for training a machine learning model to solve Captchas. Two of the leading options are Convolutional Neural Networks, or CNNs, and Recurrent Neural Networks, or RNNs.
Choose your method based on the type of Captcha you’re likely to encounter. CNNs excel at image recognition, which makes them a great choice for image-based Captchas. RNNs are well-suited to processing sequential data, which makes them a great fit for handling the more unusual audio-based Captcha.
Whichever method you choose, you can start to feed your images into the machine learning model. The model will “study” the images for patterns and build up an understanding of what Captchas can look like. The more images you feed the algorithm, the more reliably it will be able to recognize and correctly solve a Captcha.
Note that since Captcha images can vary quite a bit, it’s a good idea to use data augmentation – rotating, scaling, and flipping the images that you feed into the algorithm. This helps your machine learning model get a broader understanding of Captchas and “learn” how to decipher the text or images in a Captcha.
Step three: Test the machine learning model
The final step is testing the model to see how effective it is. Can it recognize and solve Captchas?
Carry out your tests using images that were not used in training the algorithm. After all, it’s already learned how to spot and solve those Captcha tests – you want to check whether it can solve new Captchas. Use a new set of images not used in the first steps.
You can assess the success of the training with a few different metrics. The F1 Score is a popular test that measures the accuracy and reliability of machine learning models.
Benefits of using machine learning to bypass Captcha
Machine learning models can recognize and solve both text and image-based Captchas if they are trained on a large enough data set. This process, while long and cumbersome, can yield good results.
For aspiring data scientists, or for people who already have a good working knowledge of machine learning methodology, this can also be a good chance to practice their skills. People with an interest in machine learning may also see this process as a chance to learn more.
Disadvantages of using machine learning to bypass Captcha
Although machine learning can be effective, it is not a complete solution for solving Captchas. The computer engineers who make Captchas are also using machine learning to develop Captchas that are better at fooling machine learning algorithms.
Besides, machine learning algorithms are probably beyond the reach of many web scrapers. Unless you have a data scientist on your team, it’s probably not time-efficient to train machine learning algorithms in order to bypass Captcha.
In Brief: What Are the Methods for Bypassing Captcha?
There are a number of different answers to how to handle Captcha in Selenium. To recap, they are:
- Using Undetected ChromeDriver
- Using a semi-automated approach (including some human intervention)
- Using rotating IPs
- Using machine learning
As we have seen, each of these approaches has its pros and cons. But all of these methods require users to be familiar with coding. In some cases, users would also need to use proxies in order to make the approach work.
So, what’s the best option for users who don’t have coding experience?
Working With Rayobyte To Bypass Captcha
Now that you know how to handle Captcha in Selenium, you’re well on your way to gathering the data you need to boost your productivity and understand your customers’ preferences better. But you may not have the time to dedicate to programming and coding to implement some of these solutions.
That’s why we built Rayobyte.
Rayobyte exists because we believe that proxy users deserve a company that will deliver custom-built solutions that work. We want our customers to receive top-of-the-line products that drive better outcomes and user experiences.
Proxies are a proven solution that help web scrapers get around Captcha and other common obstacles to web scraping. They let companies of any size take advantage of the benefits of big data. They also offer individuals the ability to safely and privately navigate the internet. We believe that proxies have the potential to improve businesses, by helping them to deliver better products and services. We also believe that proxies can improve individual browsing experiences by creating an extra layer of safety and privacy.
At Rayobyte and Rayobyte’s Web Scraping API, we are dedicated to guiding our customers through every step of implementing proxies so that they can complete their web scraping projects successfully. We also offer a wealth of useful information and experience that you can lean on to improve every aspect of your online experience.
Rayobyte’s Web Scraping API also makes pre-built APIs for a variety of use cases. Our APIs are designed for a range of purposes, from social media data to SERP pages. They’re highly intuitive, easy to learn, and come with 24-hour customer support.
Contact us today to learn more about the multiple scraping tools on offer and our custom solutions. Our developers can work with you to get something built that meets your needs.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.