The Most Common User Agents and Why They Matter for Web Scraping
When you type a search query in your browser, you may be unaware of a lot of action that’s happening in the background. One of these things is a user agent, which your browser sends to every website you connect with.
In its simplest form, a user agent is a line of text or string that identifies the browser to the web server. While that may sound simple, understanding how user agents function can be a bit complicated. Whenever a browser connects to a website, it has a field for user agent in the HTTP header. The content of this field is different for each browser. Thus, every browser has a distinct user agent.
Basically, a user agent is a method for your browser to introduce itself to the web server. Think of it as a web browser saying “Hi, I am a web browser” to the web server. The web server uses this information and serves different operating systems, web pages, or web browsers.
This guide goes into detail about user agents and their types and discusses the most common user agents and the role they play in web scraping.
What Is a User Agent?
A user agent is software that renders, facilitates, and retrieves web content for the end users. These include media players, plug-in, and web browsers. The user agent family also includes consumer electronics, stand-alone applications, and operating system shells.
Not every software qualifies to be a user agent. It must follow certain conditions according to Wiki.
Software is a primary user agent if:
- It is a stand-alone application.
- It interprets a W3c language.
- It interprets a declarative or procedural language that is used for the provision of a user interface.
Software is a user agent extension if:
- It expands the functionality of a primary user agent or is launched by one.
Meanwhile, the software is a web-based user agent if:
- The declarative or procedural language is interpreted to generate the user interface.
- The interpretation was done by a user agent extension or a primary user agent.
- The user interaction does not modify the Document Object Model (DOM) of the containing document.
What is a user agent in a browser?
As discussed earlier, there is a user agent field in the HTTP header when the browser connects to any website. Each browser has different content in this field. Essentially, this content introduces the browser to the web server.
The web server can further use this information for specific tasks. For instance, a website can use this information to transmit mobile pages to mobile browsers. It may also send an “upgrade” message to an older version of Internet Explorer.
Let us take a look at the most common user agents in browsers and what the information means. Here is the user agent of Firefox on Windows 7:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0
If you take a look at this user agent, you will notice that it gives a lot of information to the web server. It shows the code name Windows NT 6.1, which indicates Windows 7 is the operating system.
It also shows the code WOW64, which indicates the browser is being run on a Windows 64-bit version. Finally, it shows that the browser is Firefox 12.
Now that you have some basic understanding of a user agent, here’s an example of Internet Explorer 9:
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
While the rest of the information is understandable, the confusing part here is that the user agent is identified as Mozilla. To understand this fully, you also have to look at the user agent for Chrome:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.52 Safari/536.5
This may be even more confusing as Chrome is showing itself to be both Safari and Mozilla. Why is that? Taking a deep dive into the history of browsers and user agents will help you understand this.
String mess of user agents
One of the first-ever browsers to be used was Mosaic. The user agent of Mosaic was simply NCSA_Mosaic/2.0.Later, when Mozilla was introduced, it had the user agent: Mozilla/1.0.
Since Mozilla supported frames, it was more advanced than Mosaic as a browser. So when web servers received user agents, they sent pages with frames to the ones containing the word “Mozilla.”
As for others, the web server sent them older pages that did not have frames. With time, Internet Explorer was introduced by Microsoft. Since it was a modern browser, it also supported frames. However, Internet Explorer did not get web pages containing frames because the web servers would send all those to Mozilla. Microsoft fixed this problem by adding “Mozilla” to the Internet Explorer user agent.
They also added some extra information, such as an Internet Explorer reference and the word “compatible.” When web servers saw the word “Mozilla” in the user interface, they sent pages with web frames to Internet Explorer too. As other browsers such as Chrome and Safari came along, they took the same approach. Therefore, you will see the names of the other browsers in the user interface of an individual browser.
Some web servers also started looking for the word “Gecko” in the user agent, which is the rendering engine of Firefox. Web servers would send different pages to Gecko browsers as compared to older browsers. KHTML began to add “like Gecko” and similar words to its user agents to get the modern pages with frames from the web servers. Eventually, WebKit was introduced. Since it was KHTML-based, it also contained “KHTML, like Gecko” and “WebKit.”
The developers did this to maintain compatibility. Thus, browser developers added more words to the user agents to make them compatible with the standard and ensure that they get modern pages from the web server. That is why user agents today are much longer than those in the past. The bottom line is that web servers do not care about the exact string of user agents — they just look for certain words.
List of most common user agents
Below, we have listed the most common user agents. If you want to emulate a different browser, you can use it instead of a user agent switcher.
- Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
- Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:53.0) Gecko/20100101 Firefox/53.0
- Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)
- Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0; MDDCJS)
- Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393
- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
Why Are User Agents Important?
User agents are essential because they set browsers apart from each other. Once a web server identifies a user agent, content negotiation starts. The process refers to the mechanism in HTTP that lets you provide different versions of a resource through the same URL.
In simple words, when you enter a URL, the web server will check the user agent and show you the appropriate web page. You do not have to type a different URL when you access a website from your mobile device. The exact URL shows you different versions of a web page on different devices.
A considerable application of content negotiation is in displaying image formats. The image is provided in both PNG and GIF formats. However, the GIF version is shown to users on older MS Internet Explorer versions that cannot display PNG images. Meanwhile, PNG images are shown on modern browsers. Likewise, the web server can render different stylesheets, such as JavaScript and CSS, depending on the capability of the browser you are using. Moreover, if a user agent also has information about the language settings, the appropriate version will be displayed.
For example, a media player lets you play videos, and a PDF reader allows access to PDF documents. A PDF reader, however, will not open MS Word files since it does not recognize the information.
Agent name delivery
Agent name delivery refers to the process in which an application gets content that is tailored according to the user agent. Search engine optimization (SEO) leverages this process to show different content to the user agents of real visitors and bots.
In this process, called cloaking, regular visitors see a version of the web page that is optimized for human use. Meanwhile, crawlers see a website’s structure and content optimized for simplicity and high rankings in the search engine’s results.
User agent switching
When browsing and scraping the web, there are different reasons you may want to change your user agent. This is called user agent switching. We’ll dive into more of those specifics later.
Types of User Agents
Browsers are a typical example of user agents, but many other applications may act as user agents.
Different user agents include:
- Crawlers
- SEO tools
- Link checkers
- Legacy operating systems
- Game consoles
- Web applications, such as PDF readers, media players, and streaming portals
Humans do not necessarily control all user agents. Some user agents are controlled automatically by websites. One such example is search engine crawlers.
User Agents Use Cases
Web servers use user agents for different purposes. Some of them include:
Serving web pages: Looking at the information in the user agent, a web server determines which web page it has to serve to a web browser. Some web pages are served to older browsers while others are served to modern ones. For instance, if you have ever seen a message along the lines of “This page must be viewed in Internet Explorer,” that is due to the difference in the user agent.
Serving operating systems: Web servers also use user agents to show different content to each operating system. For instance, when you see the same web page on a mobile phone screen and your laptop, they appear different. One of the reasons for this is the user agent. If the web server receives the request from a mobile device, this will be specified in the user agent. The web server will then display a slimmed-down page that fits on the mobile device’s screen.
Statistical analysis: Web servers also use user agents to gather statistics about their users’ operating systems and browsers. Have you ever seen statistics showing that Chrome is used more than Safari or a certain percentage of people now access the web using their mobile devices? This is how that information is acquired.
Web crawling with user agents
Web crawling bots also use user agents. The user agent of the most commonly used search engine’s web crawler is:
Browser bots
Web servers give bots special treatment. For instance, they are sometimes allowed to go through registration screens without actually registering. You may be able to bypass these screens sometimes if you set your user agent to that of a search engine’s bot.
Likewise, web servers may also give orders to bots through the robots.txt file. This file tells you the rules of the site and indicates what is not allowed, like if certain data or pages cannot be scraped. For instance, a web server may tell a bot to leave. It may also tell a bot that it can only index a particular section of the website. The web server identifies the bots through their user-agent strings in the robots.txt file.
Major browsers often have ways for you to set a custom user agent. With user agent switching, you can check how the web servers respond to each browser’s user agent. For instance, you may set the user agent of your desktop browser to that of a mobile browser.
Doing this will let you see how the web server shows a web page on mobile devices. But apart from using a custom user agent, you also need to rotate them to ensure you do not get blocked.
How to rotate user agents?
To rotate user agents, you must collect a list of user-agent strings. You can get these strings from real browsers. After that, add these strings to a Python List.
Finally, define that every request should pick a random string from this list. Here is an example of how the code looks like for user agent rotation in Selenium 4 and Python 3:
While this is just one way to rotate the user agents, you can also leverage other methods. However, you need to follow specific techniques in each method:
- Make sure you are rotating a full header set associated with each user agent.
- Send the headers in the same order as a real browser would.
- Use your previously visited page as a “referrer header.”
When you are using a referrer header, make sure the cookies and IP addresses do not change. Alternatively, if you want to avoid the headache, you can use a proxy that rotates user agents for you.
In this way, you will not have to rotate the user-agents for each request manually. Proxies can set up automatic user agent string rotation and IP rotation. Thus, it will seem as if the requests are coming from multiple web browsers.
It decreases your chances of being blocked and improves success rates.Rayobyte provides different types of proxies including ISP, data center, and residential proxies that can help you accomplish this without all the manual effort and hassle.
Why Change Your User Agent?
As mentioned above, you can change the user-agent string to trick the browser into believing that you are using a different device. But why would you do this? Here are some ways in which user agent switching can help you.
Website development
When you are developing a website, it is vital to see if it is functioning correctly on different browsers. One way to do this is to download different browsers and access the website through them.
But what if you do not have a specific device that only runs a particular browser? It’s not effective to simply buy each individual device just to ensure things run smoothly.
The simpler way is to change your user agent. Doing so will let you see if your website works on all common browsers. Similarly, backward compatibility may be necessary for your website. In this case, you can change the user agent string to Internet Explorer 8 instead of installing the browser’s copy manually.
Bypass browser restrictions
Although it does not happen as commonly as it used to, some websites and web pages can only be viewed on certain browsers. For instance, you must have seen messages that say a specific web page can only be viewed correctly if opened on a specific browser.
Instead of changing the browsers, you can benefit from user agent switching.
Web scraping
When you are scraping the web for competitor pricing or any other data, you have to take some steps to ensure that you do not get banned or blocked by the target website. One of these measures is changing your user agent.
Every website identifies the browser and the operating system of the request through the user agent. Like IP addresses, if a website receives too many requests with the same user agent, it will likely block you. To avoid this, frequently change the user agent string during web scraping instead of simply sticking to one. Developers sometimes add fake user agents to the HTTP header to avoid getting blocked.
You can either use a user agent switcher or make a user agents list manually.
Some advanced users may also change the settings to a popular search engine’s user agent. Since every website wants to rank well on the most popular search engines, websites often let their browsers bot in without a hiccup.
How to Change User Agent String?
You can alter the user agent to change browser identification. The web server will see the request as if it is coming from a different browser than the one you are actually using. You can do this if the site is incompatible with your browser or you simply want to scrape the web for information.
The process of changing user agents differs for each browser. This guide will discuss the method for Chrome.
Change browser identification on Chrome
The user agent of the browser is in the Developers Tool. Click on the menu button and then go to “More Tools.” You will see “Developer Tools” here. An easier way to open this is to press Shift+Ctrl+I at the same time on your keyboard.
In the Console tab, click the menu button. Then, choose “Network Conditions.” If you cannot see the console, click the menu button on the corner of the pane. It will be next to the “x” button. Click this button and select “Show Console.”
When you are in the Network Conditions tab, you will see the option for “User agent.” It is checked by default to “Select Automatically.” Uncheck this box and choose a user agent from the existing list.
Alternatively, you can set a custom user agent. Keep in mind that this setting will only be applicable as long as the Developer Tools pane is open. Plus, it only applies to the tab you are currently using.
The primary reason for changing user agents is to ensure that a website does not block your request. But why do websites even block user requests? Mainly, it is to protect their information and avoid overwhelming the server.
How Do Websites Block Requests?
As a business, you may want to scrape websites for different purposes to gather powerful data to help you make informed decisions. One of them is price scraping.
For instance, if you are planning to establish your business, you will have to develop a price strategy. What better way to do it than checking the prices at which your competitors are selling their products? But it can be practically impossible to manually check the price for each product every single one of your competitors is selling. Instead, you can use a price scraping tool to find this information, along with other data such as product descriptions and product attributes.
Scraping tools send many requests in a short amount of time, which can overwhelm a site. This can cause slow loading times or cause sites to crash. Additionally, bad actors may try and scrape sites for unethical reasons or deliberately harm sites. Many websites have anti-scraping mechanisms to prevent their sites from being overwhelmed and to protect them from bad actors.
Here are some popular ways in which websites block companies from collecting data:
Rate limitations on IPs
Sending several requests from the same IP can come off as suspicious. The threshold for every website differs. For instance, one website may consider 20 requests from the same IP too many, while another may consider 200 requests as the max.
IP geolocation detection
Some websites block access based on the geographical location from where the request is coming. For instance, websites in a particular country may only accept requests coming from that country. This could be due to government restrictions. Some websites also limit access due to licensing restrictions based on TV and other media deals.But there is a way to go around this. You can make it appear to the website that you are sending a request from its country. You can access websites in these countries by using the relevant proxies.
User agent detection
Websites also detect the user agent to determine if the request comes from a bot or a human user. That is why you should change browser identification by using a custom user agent.
How to Avoid Getting Banned While Web Scraping?
When scraping the web, you have to be entirely responsible and careful. Many website owners are not believers in open data access. Similarly, if you are scraping your competitors’ websites for price comparison, you may be banned if you send too many requests as it can slow down the websites.
Here are a few tips to avoid bans during web scraping:
Bypass anti-scraping mechanisms — but be ethical
You of course want to be ethical and respectful when gathering data.
Firstly, you need to know what the robots.txt file contains and how it functions. The file tells the crawlers which pages they can or cannot request from a website. By doing so, they avoid overloading the website and slowing it down with requests. The file also has scraping rules. For instance, most websites allow search engines’ bots to scrape them since they want to rank high in the search results. Meanwhile, some websites have Disallow:/ in the robots.txt file. It means they do not want certain data from their website to be scraped.
Websites that have an anti-scraping mechanism in place ensure the efficacy of the process by checking if the request is coming from a bot or a human being. They do so by monitoring the following points:
- Speed: Humans can send requests to websites at a certain speed. Meanwhile, a bot can send a request every minute or even multiple times a minute. If you send many requests way beyond a human’s capability, the anti-scraping mechanism will label you as a bot.
- Pattern: The anti-scraping mechanism also checks the pattern of the requests. Are you only targeting the links or the product descriptions on every page of a website? If the mechanism detects a pattern, it could categorize you as a bot.
- IP address: If you use the same IP address to send an excessive number of requests, you will most likely get blocked.
Use random intervals to space requests
If you send requests to a website at oddly specific times, its anti-bot mechanism will easily detect that the request is not coming from a human. Real people are not this specific or predictable.
A human would not send a request to a website every 10 minutes in the middle of the day. If you want to avoid getting blocked, it is best to use randomized delays. In this way, you will also be able to comply with the website’s rules and not overwhelm the server. The best way to find the right delay time is to check the robots.txt file. It will have the crawl limit of the website, which is the number of requests a website accepts in a certain duration. Follow the crawl limit and wait the right amount of time before sending a subsequent request.
Additionally, scrape during off-peak hours, which are often overnight. This will ensure you aren’t overwhelming a site when human users are trying to shop or browse a site’s pages.
Use the right proxy
Keeping in line with the point discussed above, IP rotation can prevent bans. But the type of IP address you are using also determines your likelihood of getting banned or blocked.
For example, if you use a residential IP address, you have a lower chance of getting blocked since residential IP addresses are linked to human visitors. Meanwhile, data center proxies have a comparatively higher chance of being banned since they are not associated with human visitors.
Using a suitable proxy will help:
- Lower the chance of IP blocks.
- Increase your anonymity.
- Bypass geo-targeted blocking.
- Make you more secure during web scraping.
Rayobyte’s rotating residential proxies are perfect for large-scale scraping since they allow you to gather information from the web without getting banned or blocked. Since these IPs come from residential sectors, they appear more natural and humanistic to websites.
Rayobyte ethically sources these IPs from actual residential users who have agreed to give their IP addresses to the network for ample compensation. Therefore, these IPs have the lowest risk of bans.
More importantly, these are rotating proxies, which means they keep changing. If you want to achieve higher web scraping success rates without worrying about the number of requests you can send per minute, the Rayobyte residential proxies are a great pick for you.
Meanwhile, the data center proxies from Rayobyte are equally effective for web scraping.Most importantly, Rayobyte has nine autonomous system numbers (ASNs). Even if one ASN gets blocked by the website, you would not experience downtime. Instead, you can simply route your web scraping process to one of the other ASNs.
Use a user agent that looks organic
We have already talked about web servers identifying browsers and operating systems through the user agent — if a website receives too many requests with the same user agent, it may block you.
However, if you change browser identification for each request, you have a lower risk of being blocked. But while you are managing all other company operations, you may not have the time or expertise to switch user agents periodically. That is where Rayobyte’s Web Scraping API comes in. The talented team at Rayobyte’s Web Scraping API can build custom scraping solutions for your precise needs, regardless of what your budget is. You can leave your user agent rotating worries to Rayobyte’s Web Scraping API and focus on other business tasks.
New modules are constantly added to Rayobyte’s Web Scraping API, so you are sure to find exactly what you need for your scraping tasks. But if you have any specific requirements, you can benefit from the provider’s custom solutions.
Use CAPTCHA solving solutions
Many websites use a Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) to determine if the request is coming from a bot or a human user. This is often put in place because the websites do not want you to steal their information. In a CAPTCHA, you have to click on certain images as instructed.
Unfortunately, computers cannot possibly read these images. Therefore, when web scraping, you may have to resort to services that automatically resolve CAPTCHAs. These tools are ready-to-use and allow you to bypass these restrictions.
Try a headless browser
Headless browsers are web browsers that do not have a user interface. In simple terms, this means that they are just like the browser you regularly use, but they do not have URL bars, bookmarks, and tab bars.
Instead of a URL, these browsers require you to guide them programmatically. Basically, you write a script to tell them how they must act. Although they do not have visual interaction features, these browsers let you emulate downloading, scrolling, and clicking as a regular browser does. This feature makes them ideal for repetitive tasks, such as scraping and web crawling. Moreover, as visual features of the browser do not have to be loaded, the time taken for the job to complete is lesser. You also save resources by using headless browsers.
Keep in mind that these browsers are memory and CPU-intensive. Thus, they may cause crashes. When you use a regular HTML extraction tool to scrape the web, sites can detect signs like JavaScript execution and extensions. They can then block you if they find out that you are not a real user.
A headless browser lets you emulate interactions with a platform via users who rely on JavaScript elements. Therefore, you can use these browsers to get information from websites with the most stringent regulations.
Scrape smart and ethically
Keep these pointers in mind when scraping the web. Do not send too many requests in a certain timeframe, use different IP addresses, and make your robot look as organic as possible.
If you need access to multiple IPs when you only have a single browser or device, Rayobyte does the job for you. The provider’s residential and data center proxies are specialized for big and small companies, meeting their web scraping needs in the most efficient way possible.
How Do Proxies Help Enterprises Collect Data?
Proxies such as those provided by Rayobyte can help enterprises gather data for different purposes. As an entrepreneur or a running business, you would be intrigued to know how scraping the web using proxies can help your business right now and in the long run.
Competitive analysis
Gone are the days when one business had a monopoly in the market because they were the only ones making a particular product or offering a specific service. Today, there are many options for customers.
Therefore, it is essential to know what your competitors are up to and how you can have a leg up on them. Web scraping using proxies can help you gain this edge over your competition.
Suppose you are starting a new business. Initially, you may not have a clear idea of getting started and which areas to focus on. But if you scrape your competitors’ websites, you can gather plenty of data about factors that drive the consumers’ buying decisions.
For instance, you can check the pricing strategies your competition is using. Which price range do the products fall in? What do the prices drop to during sales? Additionally, you can check other things like product descriptions and visuals. Do your competitors also show product videos along with pictures? Which attributes are they mentioning in the product description?
These things can help you appeal to the customers — if a specific trend is working for the majority of your competitors, it will most likely work for you too.
Product optimization
When buying a product online, customers often read item reviews beforehand to know what other people’s experience was like. Interestingly, you can also use this information to enhance your own product according to the customers’ liking.
You can scrape mentions of your product on different sites to see what people are saying about it. You can also scrape your competitors’ websites and other sites for mentions of products that are similar to yours, and specifically target customer reviews to identify what the customers want — or what they don’t.
If in most reviews, customers mention that they would prefer your product to be available in different colors, you can focus on introducing that and meeting their wants and needs. In this way, you do not have to go the trial-and-error route because someone already did that for you. You can just use this readily available information to make your offerings better than the competition.
Final Words
In short, the most common user agents act as intermediaries between the internet and the user. They give the web server necessary information about the device, browser, software, and operating system you are using to send the request to a website. Web servers can then use this information to show you relevant web pages.
As a business, you will have to change and rotate your user agents if you intend to scrape the web for competitor analysis and data collection. If you do not want to deal with the hassle of manual scraping or blockages by websites, try Rayobyte’s residential and data center proxies. Rayobyte proxies power the scraping needs of companies and organizations like small businesses, government agencies, and the Fortune 500.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.