Why Is My Crawler Being Detected?
If you run a business that requires gathering a lot of data quickly, you’ve probably tried web scraping. But you might be getting frustrated by how often you get your crawler detected while trying to retrieve important information.
Even veteran data miners can have trouble keeping up with the firewalls designed to prevent data scraping. But there are measures you can take to make sure your web crawler doesn’t get detected. Many of these measures also protect your anonymity and boost your security.
Here we’ll discuss why your web crawler was detected and how you can avoid being blacklisted in the future.
Why Your Crawler Is Being Detected
A few factors can lead to your crawler getting spotted, and likely banned, by the site you’re mining data from. For example, you might not be using proxies (more on them later), or you might have overlooked key guidelines in the site’s robots.txt file.
Usually, though, the reason you’re getting blocked is one of the following:
- Site cookies
- Your browser’s user agent
- Your IP address
- Your bot’s behavior
Websites save tracking cookies on your browser (even if you opt out) whenever you visit them. If you come back again without clearing the cookies, the server will recognize your browser from before and block it if it sees any bot-like activity. If you’re using a browser-based web crawler without taking any measures to block cookies or conceal your identity, you’ll get spotted and banned pretty quickly.
Browsers also attach a string of characters to every communication they make with a web server that identifies them. That string of characters is called the user agent, and it can be used to tie online activity to a specific browser if you don’t cover your tracks. The user agent contains everything from what browser you’re using to what version of that browser to what device you’re using it on.
Your internet protocol (IP) address is your device’s unique online identifier. It tells the server you’re requesting data from your device’s location. If a site sees an immense number of requests coming from one IP address, it will flag that IP as a bot and block it. Check here for your computer’s IP Address.
Any one of the above — or a combination of these factors — can leave you staring at your screen wondering why it says “crawler detected.” But there are various tools and best practices you can use to protect your crawler from detection.
How to Avoid Being Detected While Web Scraping
The best way to avoid detection is to cover your tracks. Blasting a site with hundreds of requests from your home desktop IP address is a good way to get permanently blocked from that page. Luckily, there are multiple methods you can use to avoid that.
1. Use proxies to conceal your IP
It’s always a good idea to hide your IP when web scraping. Using proxies masks your true IP with another, normal-looking address. Proxies act as go-betweens, masking the requests you send and directing the data you receive from the server back to your device.
There are varying levels of proxy addresses you can use. Data center proxies will mask your IP, but they’ll usually identify themselves as proxies unless they’re elite private proxies. You get the anonymity you need, but if a site is set up to block any kind of proxy, you could still run into trouble.
Residential proxies use the IP addresses of personal devices, so they can more closely mimic a real user. They don’t identify themselves as proxies, so the website server you’re on just thinks you’re another person.
Having multiple residential proxies and rotating them out offers a very high degree of anonymity and a low chance of being blocked. We invite you to learn more about Rayobyte’s residential proxies or get started today.
2. Use a headless browser
Headless browsers might sound weird, but they’re a valuable item to have in your kit. They operate as a normal web browser, but they don’t use a graphical user interface (GUI). That is, they don’t have a browser window.
Headless browsers are designed for scraping bots to use, not humans. Crawlers don’t need to see the sites they interact with the way people do; they just need to be able to comb through the data. You can simply give the bot commands via the command prompt to find the data you need.
3. Check the robots.txt file
This is a cardinal rule of web scraping, and it’s relevant here, too. You may also hear this referred to as the robots exclusion protocol, and it’s basically a read-me file outlining what bots can and can’t do on a particular website.
If the site you’re scraping data from has a robots.txt file, read it. It will often contain guidelines for how to interact with the page. Some files will even specify an allowed length of time between requests so you can program your crawler accordingly.
Sometimes the robots.txt file will tell you if the site blocks crawling from any bots, period. If the site doesn’t allow crawling, don’t do it.
4. Choose the right user agents
A complete lack of a user agent or one that looks artificial will be a red flag to a website server. Sending a request with one can get your crawler detected and banned. What you’ll need to do instead is work with user agents that mimic real users.
Many developers will program authentic-looking user agents into their web scraping bots, and they’ll use them automatically. You can also put together your own custom list of user agents to pull from when scraping.
5. Leave the images alone
When scraping data from a website, just take the publicly available data that’s relevant to your needs. There is a higher chance that images are protected by copyright, meaning you’d be infringing on the creator’s right by scraping them. Images are also large and data-heavy, meaning they could slow down data mining.
Instead of writing and using a more complicated scraping technique that requires a site to load all data, just leave the images alone.
6. Do your web scraping during off-hours
If you send a high number of requests to a site’s server during peak traffic hours, it’s going to slow everything down. It places an unnecessary burden on the server and takes longer for you to get what you need. Not only that, the extra load on the server slows the page down for everyone else who visits. Avoid scraping during peak hours.
Instead, find the times your target web page doesn’t see a lot of traffic and crawl it then. This will vary from page to page, but late night is a good place to start. See where the server is located, and try data scraping the site around midnight local time.
7. Slow down
Sending a huge number of requests will get you flagged as a bot, especially if you’re not using proxies. If you are, the IP address you’re using could still get blocked or rate limited if you hammer the server with too much traffic at once.
Stagger your requests to send at random intervals of a few seconds or even a couple of minutes. That way, it appears the requests are coming from regular users, especially if you’re using rotating residential proxies.
Again, the robots.txt file is a good place to look for a lead on your request rate. There may be a specification that an IP can only send a certain number of requests before it gets throttled or blocked. Abide by that, and it’s much less likely your crawler will get detected.
8. Pattern your crawler after real users
If your crawler bot goes through every web page it finds in the same pattern, you could be flagged as a bot even if you’re taking other precautions. To avoid that, switch up your bot’s crawling pattern now and then, and program it to behave like a human user.
Adding mouse movements, clicks, and scrolls into the pattern can help your bot look more like a person. Just don’t make the behavior too random — super erratic behavior is just as much a red flag as the same pattern every time.
When programming your bot’s crawling pattern, think about how a person would interact with a page. Depending on what they’re looking for, they might land on the homepage first and then make requests to a few of the other pages within that site. Since you’ll only be looking for specific information, you can program your bot to act this way and still get the data you need.
9. Be mindful of fingerprinting
Websites can use your browser fingerprint to flag you as a bot, even if you’re using proxies. It’s commonly used together with IP logging and cookies to track visitors to a site. Browser fingerprinting keeps track of the information logged from a browser when it visits a specific site, like:
- What browser version you’re using
- What fonts you use
- What plugins you have
And more. While this may seem general, a browser fingerprint is actually unique enough to connect it back to an end-user with enough data and time.
Modern web scraping bots are being built to mimic regular users’ browsing fingerprints to get around measures like these. An AI persona submits a fingerprint to the server to slip under the radar.
10. Beware of honeypots
Your bot might be getting detected by clicking on a link visible only to crawlers. These links are called honeypot traps because they attract a bot by appearing as just another link.
More and more people are writing honeypots links into their code because they make it very easy to detect when a bot visits their page. The links are coded so as not to be visible to the human eye, so only a bot would know to crawl them. Once you click them, you’re quickly blocked.
Honeypots are hard to find but usually written into a site’s CSS style sheet. Look for attributes like visibility:hidden or display:none around links, as they may be a sign of honeypot traps. Another common way to hide these links is by making them the same color as a page’s background, often white. A color hex value of #fff or #ffffff indicates white, so it’s probably a honeypot trap if a link is that color.
Keep the tips above in mind the next time you’re data mining, and you’ll be much less likely to get banned. You’ll also avoid making trouble for the site administrator, so it’s a win-win.
- Use proxies to mask your true IP address, and rotate them often.
- Program your bot to mimic human behavior — random actions are good, but nothing too erratic.
- Make sure your browser is using a valid user agent.
- Avoid scraping images.
- Vary your request intervals.
Do that, and you’ll have an easier time getting the data you need for your next project. And remember to be respectful: check the robots.txt file, respect copyrights, and don’t overload site servers by sending requests during peak hours.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.
Sign Up for our Mailing List
To get exclusive deals and more information about proxies.
Start a risk-free, money-back guarantee trial today and see the Rayobyte
difference for yourself!