The Ultimate Guide to Use a Proxy with Selenium in Python

Selenium, an open-source library,  lets you simulate user interactions, making it a popular choice among web scrapers. With it, you can navigate through web pages, fill out forms, and extract data in a manner that mimics human behavior. This is particularly useful when dealing with websites that rely heavily on JavaScript for content rendering.

However, web scraping can often feel like a game of whack-a-mole. Websites continuously evolve to detect and block scrapers. This makes proxies the best way to overcome this issue. A Selenium proxy acts as a middleman between your scraping script and the target website, allowing you to route your requests through different IP addresses. When scraping data from websites, especially in large volumes, using a single IP address can raise red flags, leading to blocks or CAPTCHAs. Proxies mitigate this issue by distributing the load across multiple IPs, allowing your scraper to work undetected.

Integrating a Selenium proxy is akin to wielding a double-edged sword. It empowers you to scrape dynamic content with the dexterity of Selenium at scale without the looming threat of IP bans, thanks to the anonymity provided by a Selenium proxy. If you’re wondering how to use proxy Selenium Python, this guide will walk you through the process.

 

Try Our Residential Proxies Today!

 

How to Set Up a Chrome Selenium Proxy

How to Set Up a Chrome Selenium Proxy

The following tutorial will guide you through Selenium proxy settings in Chrome.

Install Selenium

Open Command Line or Terminal: Open your terminal or command prompt.

  1. Install Selenium: Type the following command to install Selenium: pip install selenium. This uses Python’s package manager (pip) to install Selenium.
  2. Verify Installation: You can verify that Selenium is installed by opening a Python interpreter (Python in the command line) and typing import selenium. If you don’t get any errors, Selenium was installed successfully.

Download ChromeDriver

ChromeDriver is a separate component that Selenium uses to control Chrome. Go to the ChromeDriver download page and download the version that matches the version of Chrome you have installed. Extract the executable from the zip file.

Create Python script

Create a new Python script file: scrape.py

Import Necessary Modules: At the top of your script, import the necessary modules.

from selenium import webdriver

from selenium.webdriver.common.keys import Keys

Set up Chrome options

Before launching the browser, you need to set up the options with which Chrome will be started. One of these options is to tell Chrome to use a proxy server.

chrome_options = webdriver.ChromeOptions()

chrome_options.add_argument(‘–proxy-server=http://proxy_ip:proxy_port’)

Replace proxy_ip and proxy_port with the IP address and port number of your Selenium proxy server. If you’re using a different type of proxy, like SOCKS, change HTTP to SOCKS5.

Launch Chrome with Selenium proxy

Now, launch Chrome with the options you just set up. You need to specify the path to ChromeDriver that you downloaded earlier.

driver = webdriver.Chrome(executable_path=’/path/to/chromedriver’, options=chrome_options)

Find the data you need

Now that Chrome is open, you can control it through your script. For example, navigate to a web page:

driver.get(‘https://www.example.com’)

You can use various methods to interact with the page, like finding elements by their tags, attributes, CSS selectors, and more. For example, to find a search bar, enter the text:

search_bar = driver.find_element_by_name(‘q’)

search_bar.send_keys(‘web scraping with selenium’)

search_bar.send_keys(Keys.RETURN)

Handle page loading

Sometimes you need to wait for elements on the page to load. Use WebDriverWait for this.

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

try:

element_present = EC.presence_of_element_located((By.ID, ‘element_id’))

WebDriverWait(driver, timeout).until(element_present)

except TimeoutException:

print(“Timed out waiting for page to load”

How to Add Proxy Selenium Python in Firefox

How to Add Proxy Selenium Python in Firefox

Setting up a Selenium proxy with Firefox is similar but involves configuring the Firefox WebDriver (GeckoDriver) and specifying the proxy settings.

Download GeckoDriver

Firefox uses a driver called GeckoDriver. Go to the GeckoDriver download page and download the version corresponding to your Firefox browser. Extract the executable.

Create Python script and import modules

Create a new Python script file: firefox_proxy_scrape.py

Import Necessary Modules: At the top of your script, import the necessary modules.

from selenium import webdriver

from selenium.webdriver.common.proxy import Proxy, ProxyType

from selenium.webdriver.firefox.options import Options

Set up Firefox Selenium proxy settings

Create a Proxy object and set its attributes for HTTP and SSL.

my_proxy = “proxy_ip:proxy_port”

firefox_capabilities = webdriver.DesiredCapabilities.FIREFOX

firefox_capabilities[‘proxy’] = {

“proxyType”: “MANUAL”,

“httpProxy”: my_proxy,

“ftpProxy”: my_proxy,

“sslProxy”: my_proxy

}

Replace proxy_ip and proxy_port with the IP address and port number of your Selenium proxy server.

Configure additional Firefox options

You can also configure additional options for Firefox using Options. For example, if you want Firefox to run in headless mode (without GUI), you can add:

firefox_options = Options()

firefox_options.headless = True

Launch Firefox with Selenium proxy

Now, launch Firefox with the capabilities and options you just set up. Specify the path to GeckoDriver.

driver = webdriver.Firefox(executable_path=’/path/to/geckodriver’, capabilities=firefox_capabilities, options=firefox_options)

Find the data you need

With Firefox open, you can control it through your script. For example, navigate to a web page:

driver.get(‘https://www.example.com’)

Use various methods to interact with the page, like finding elements by tags, attributes, CSS selectors, and more. For example, to click a button:

button = driver.find_element_by_id(‘myButton’)

button.click()

Handle page loading and delays

Sometimes you need to wait for elements on the page to load. You can use WebDriverWait for this.

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

element = WebDriverWait(driver, 10).until(

EC.presence_of_element_located((By.ID, “myDynamicElement”))

)

Best Practices for Using a Selenium 4 Proxy

Best Practices for Using a Selenium 4 Proxy

You’ll find more success in web scraping with a Selenium proxy if you engage in the following best practices:

Rotate your Selenium proxy

Rotating a Selenium proxy involves cycling through a list of proxy servers and using a different Selenium proxy for each session or request. Here’s how you can change proxy Selenium Python:

  • Gather your Selenium proxy list: Compile a list of Selenium proxy servers that you want to use. This list should contain the IP address and port of each proxy server.
  • Initialize WebDriver with the proxy: For each Selenium proxy in your list, initialize a new WebDriver instance and configure it to use the proxy server.
  • Perform actions and close the WebDriver: Use the WebDriver to navigate to web pages and perform actions. Once you’re done, close the WebDriver.
  • Repeat with the next Selenium proxy: Loop through your list of Selenium proxy servers, initializing a new WebDriver with the next Selenium proxy in the list each time.

Here’s an example code that demonstrates this process:

from selenium import webdriver

# List of proxies

proxies = [

‘http://proxy1:port’,

‘http://proxy2:port’,

# …

]

# List of URLs to scrape

urls = [

‘http://example1.com’,

‘http://example2.com’,

# …

]

# Loop through each proxy and url

for proxy, url in zip(proxies, urls):

# Initialize Chrome Options

chrome_options = webdriver.ChromeOptions()

# Set the proxy

chrome_options.add_argument(f’–proxy-server={proxy}’

# Start the WebDriver with the proxy options

driver = webdriver.Chrome(options=chrome_options

# Navigate to the web page

driver.get(url

# Perform your scraping or interaction here

# …

# Close the WebDriver

driver.quit()

Frequently opening and closing WebDriver instances can be resource-intensive. If you have a large number of proxies and URLs, you might want to implement some delay or use a more efficient approach like browser pooling.

Set user-agents

A user-agent is a string that web browsers send to servers as part of the HTTP request headers. This string tells the server information about the browser and the operating system from which the request is made. Websites sometimes use this information to serve different content depending on the browser or device or to block or limit requests from non-browser clients like web scrapers.

When web scraping, it’s common for servers to block or throttle requests with suspicious or non-standard user-agent strings. To bypass these limitations and mimic a real browser, you can set a custom user-agent string in your requests.

You can also change the user-agent string for each request. This can make your scraping bot look like different browsers or devices, making it harder for websites to detect and block it.

Handle CAPTCHAs

Handling CAPTCHAs can be quite challenging because CAPTCHAs are specifically designed to prevent automation. The best way to handle CAPTCHAs is to avoid triggering them in the first place. This can be achieved through:

  • Respecting rate limits: Avoid making too many requests in a short period. Implement delays between your requests.
  • Using realistic user agents: Mimic real browsers by setting user-agent strings.
  • Rotating IP addresses: Use a rotating Selenium proxy to avoid IP-based rate limiting.

Follow Robots.txt guidelines

Robots.txt is a file that webmasters use to instruct web crawlers and scraping bots on which parts of the site should not be crawled or scraped. Respecting robots.txt is considered good practice and part of being a responsible and ethical web scraper.

Understanding robots.txt

A robots.txt file is usually located at the root of a website. The file contains directives that are meant to communicate the webmaster’s wishes to bots. The basic directives are:

  • User-agent: Specifies the web crawlers or user agents the rule applies to.
  • Disallow: Tells the user agent not to crawl the specified URL path.
  • Allow: Tells the user agent that it can crawl the specified URL path (this is used to make exceptions to Disallow rules).

Example of a robots.txt file:

User-agent: *

Disallow: /private/

Disallow: /restricted/

User-agent: Googlebot

Allow: /special-content/

In the example above, all bots are disallowed from crawling URLs under /private/ and /restricted/, but Googlebot is specifically allowed to crawl URLs under /special-content/.

Respecting robots.txt in Web Scraping

  • Check for robots.txt: Before scraping a website, always check if there is a robots.txt file by appending /robots.txt to the site’s base URL.
  • Parse and understand the rules: Analyze the directives in the robots.txt file to understand which parts of the site you’re allowed to scrape. There are also libraries like robot exclusion rules parser in Python that can help you parse these rules programmatically.
  • Follow the rules: Respect the directives in the robots.txt file. Do not scrape pages or directories that have been disallowed.
  • Be mindful of Crawl-Delay: Sometimes, the robots.txt file might have a Crawl-delay directive. This tells you the number of seconds you should wait between successive requests. Respecting this can help prevent overloading the server.

Legal and ethical considerations

While robots.txt is a standard that webmasters use to communicate their preferences, not all bots respect it. However, as a responsible web scraper, it’s good practice to adhere to these rules. In some places, not respecting the rules in robots.txt might have legal implications, especially if scraping is done in a way that causes harm or accesses sensitive data.

Always be mindful of the ethical and legal aspects of web scraping, and strive to minimize the impact of your scraping activities on the websites you scrape.

Implement delays and throttling

When using Selenium for web scraping, implementing delays and throttling will help mimic human-like interactions and avoid overwhelming the target website.

Implicit waits

Implicit waits tell WebDriver to poll the DOM for a certain amount of time when trying to find elements. This is useful when specific elements on the webpage are not immediately available and need time to load.

from selenium import webdriver

driver = webdriver.Chrome()

driver.implicitly_wait(10)  # Wait up to 10 seconds for elements to be available

# Navigate to URL

driver.get(“http://example.com”)

# WebDriver will now wait up to 10 seconds before throwing a NoSuchElementException

element = driver.find_element_by_id(“myDynamicElement”)

Explicit waits

Explicit waits are used when you need to wait for a specific condition to occur before proceeding. This is more efficient than implicit waits as it waits only as long as necessary.

Example:

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()

# Navigate to URL

driver.get(“http://example.com”)

# Wait for the element to be present before interacting with it

element = WebDriverWait(driver, 10).until(

EC.presence_of_element_located((By.ID, “myDynamicElement”))

)

Fixed delays with time.sleep()

You can use Python’s time.sleep() function to pause the execution of the script for a fixed number of seconds. This can be used to mimic human behavior by adding pauses between actions.

Example:

import time

from selenium import webdriver

driver = webdriver.Chrome()

# Navigate to URL

driver.get(“http://example.com”)

# Wait for 5 seconds

time.sleep(5)

# Perform some action

driver.find_element_by_id(“someElement”).click()

Randomized delays

To make your scraping bot less predictable, you can introduce random delays between actions using a random library.

Example:

import time

import random

from selenium import webdriver

driver = webdriver.Chrome()

# Navigate to URL

driver.get(“http://example.com”)

# Wait for a random number of seconds between 2 to 5

time.sleep(random.uniform(2, 5))

# Perform some action

driver.find_element_by_id(“someElement”).click()

Throttling network through Chrome developer tools:

When using Chrome WebDriver, you can simulate different network conditions by enabling the Chrome Developer Tools and setting the network throttling profile.

Example:

from selenium import webdriver

chrome_options = webdriver.ChromeOptions()

chrome_options.add_experimental_option(“prefs”, {

“profile.default_content_setting_values.notifications”: 2

})

driver = webdriver.Chrome(chrome_options=chrome_options)

driver.execute_cdp_cmd(“Network.enable”, {})

driver.execute_cdp_cmd(“Network.emulateNetworkConditions”, {

‘offline’: False,

‘downloadThroughput’: 500 * 1024 / 8,  # 500kb/s

‘uploadThroughput’: 500 * 1024 / 8,    # 500kb/s

‘latency’: 50                         # 50 ms

})

# Navigate to URL

driver.get(“http://example.com”)

Implement error handling and logging

Error handling and logging ensure the stability of your script and record the events or problems that occur during execution.

Error handling:

When your script interacts with web elements, there may be cases where elements are not found, pages fail to load, or other exceptions occur. To handle these exceptions gracefully, you can use Python’s try-except blocks.

Example:

from selenium import webdriver

from selenium.common.exceptions import NoSuchElementException

driver = webdriver.Chrome()

try:

# Try to find an element

element = driver.find_element_by_id(“nonexistent_id”)

except NoSuchElementException:

# Handle the exception if the element is not found

print(“Element not found”)

For different types of exceptions, you can have different except blocks. For instance, TimeoutException can be caught separately to handle cases where an element takes too long to load.

Logging

Logging is essential for monitoring and debugging. Python’s built-in logging module lets you log messages to a file or console.

Example:

import logging

from selenium import webdriver

from selenium.common.exceptions import NoSuchElementException

# Configure logging

logging.basicConfig(filename=’example.log’, level=logging.INFO,

format=’%(asctime)s:%(levelname)s:%(message)s’)

# Initialize WebDriver

driver = webdriver.Chrome()

try:

# Try to navigate to a URL

driver.get(“http://example.com”)

logging.info(“Navigated to http://example.com”)

# Try to find an element

element = driver.find_element_by_id(“nonexistent_id”)

except NoSuchElementException:

# Log the exception

logging.error(“Element with ID nonexistent_id not found”)

 

# Close the WebDriver

driver.quit()

This code configures logging to write messages to a file named example.log. It logs info messages when navigation is successful and error messages when an element is not found.

Combining error handling and logging

You can combine error handling and logging to create robust scripts. For example, when you catch an exception, you can log the details of the exception to a file. This will help you understand what went wrong when reviewing the logs later.

import logging

from selenium import webdriver

from selenium.common.exceptions import NoSuchElementException

# Configure logging

logging.basicConfig(filename=’example.log’, level=logging.INFO,

format=’%(asctime)s:%(levelname)s:%(message)s’)

# Initialize WebDriver

driver = webdriver.Chrome()

try:

# Try to find an element

element = driver.find_element_by_id(“nonexistent_id”)

except NoSuchElementException as e:

# Log the exception

logging.error(f”Exception occurred: {e}”)

# Close the WebDriver

driver.quit()

When Do You Need To Use a Selenium Proxy Server?

When Do You Need To Use a Selenium Proxy Server?

When scraping without a browser, you can use a Selenium proxy server to save time, bandwidth, and server performance. Let’s discuss scenarios where using Selenium is necessary and where it might be overkill.

When to use Selenium

  • Dynamic content: When the website you are scraping relies heavily on JavaScript to load content, Selenium can be a good choice as it allows you to interact with the web page in a way that mimics a real user using a browser, ensuring that all the dynamic content is loaded.
  • Page interactions: If you need to perform actions like clicking buttons, filling and submitting forms, or scrolling through pages to scrape the data you need, Selenium can be really helpful since it can automate these interactions.
  • Dealing with IFrames: Some websites use IFrames to embed content from another source, and this content may not be easily accessible using traditional HTTP requests. Selenium can switch between frames and scrape content from within IFrames.
  • Captchas and login pages: In some cases, you might need to handle captchas or login through authentication pages to access the content. Although handling captchas is tricky, sometimes using a browser automation tool like Selenium can make this easier compared to using raw HTTP requests.

When not to use Selenium:

  • Static websites: If the website is static — meaning the content of the page is loaded through plain HTML without any client-side scripting like JavaScript — then use a simpler HTTP request through libraries. These can include requests coupled with a parsing library like BeautifulSoup, which can provide faster and more efficient results.
  • Large-scale scraping: If you’re scraping a large volume of data, Selenium might not be the best choice due to its resource-intensive nature. In such cases, you might want to look into more lightweight solutions.
  • API endpoints: If the website loads data from an API, it might be more efficient to directly make HTTP requests to this API to fetch the data in a structured format like JSON rather than rendering the entire webpage with Selenium.

 

Try Our Residential Proxies Today!

 

Final thoughts

Final thoughts

Although it can seem like a complex topic, learning how to use proxy in Selenium Python can significantly enhance your web scraping projects. By circumventing IP-based restrictions and accessing content from different geolocations, a Selenium proxy lets you scrape data more efficiently and anonymously. Now that you understand how to set up the environment for a Selenium proxy, the configuration of Selenium to use proxies, and the integration of this configuration into your Python code, you can supercharge your scraping efforts.

However, your web scraper will only be as good as your Selenium proxy. For the best proxies, contact Rayobyte today. With internationally located proxies, Rayobyte lets you access content from around the globe at lightning-fast speeds. You can effortlessly switch between HTTP, HTTPS, and SOCKS protocols, catering to various web scraping requirements.

Our top-notch security protocols give you peace of mind when scraping, and our high ethical standards ensure you never have to worry about being affiliated with shady practices or malicious actors — an increasingly important concern in our data-driven world. Whether your focus is on eCommerce data extraction, SEO monitoring, or scraping social media, Rayobyte has the perfect Selenium proxy.

We believe you deserve a proxy provider who wants to be your partner. Our experts are happy to help you crawl, scrape, and scale whatever you need! Reach out today to learn more.

The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.

Sign Up for our Mailing List

To get exclusive deals and more information about proxies.

Start a risk-free, money-back guarantee trial today and see the Rayobyte
difference for yourself!