The Complete Guide to Ethical Web Scraping
Ethics is never an easy topic. If it was, they’d teach it in kindergarten. But if you made your way to a guide all about ethical web scraping – then clearly you at least recognize its importance. You’re a good person 😉
In fact, web scraping enables collecting data on a scale larger than ever before, and you know what they say about great power.
Fast, Effective, Efficient
Powerful API backed by ethical proxies.
data:image/s3,"s3://crabby-images/cea12/cea1275512c9994e6d64149076206e6dcf841156" alt=""
Since digital adoption leaped forward in the last decade or so, web scrapers face a wider ethical responsibility. So, if you haven’t guessed already, this guide not only covers the ethics of web scraping, but also best practices to ensure you maintain the moral high ground in your web scraping endeavours.
Is Web Scraping Ethical?
Data collection and web scraping is as ethical as the person doing it. Since you determine the data scraping efforts – what data is collected, and the tools and processes used to do so – ethical web scraping is a decision.
data:image/s3,"s3://crabby-images/aa287/aa287aec04818c0f0ad2a09830dcbe8fc180198e" alt=""
The TL;DR of any ethical web scraping process is don’t be a d**k. But we could also split this into two further sections:
- Ethical data collection. In other words, the actual data you are collecting. If you’re after private information, then there’s simply no way to make that ethical.
- Ethical web scraping methods and tools. Alongside the “what” of data, you also need the “how” of web scraping.
Now, if you’re a fan of cliches or tired metaphors, you can think of the internet like a library. Information is everywhere. Sure, you’re allowed to go in and read the books – you’re also allowed to take notes. But copying a book word for word to resell it later? That would be unethical.
So when does the former become the latter? That’s where ethics matter the most. It’s just as true in web scraping as anywhere else.
In the rest of this text, we’ll explore the dos and don’ts of data scraping – the best practices to ensure ethical web scraping done right.
Web Scraping Ethics vs Legality
Similarly, you might be wondering if web scraping is legal? The short answer is again, yes, so long as you’re not being stupid.
data:image/s3,"s3://crabby-images/94b08/94b08c3830e71165695e38a7cd07ca96ab0e0b8a" alt=""
For the most part, web scraping ethics and the law tend to overlap. Accessing private data is both unethical and illegal. Slowing a website down with too many requests? Also unethical, and legally could be seen as a direct attack on a given site – especially in the eyes of the website owner.
Ensure Ethical Data Collection
Let’s start with the first part of web scraping: what and how you scrape. When you scrape data, the information you are collecting can either be ethical or unethical. You could get publicly accessible information, but you could also do it through unethical means.
Implement the following data collection practices, and you’ll be keeping your web scraping on the right side of everything.
Only Scrape Public Data
Generally, if its in the public domain, it’s fair gain. In other words, if you could access it from a computer and writing it down manually, it’s probably ethical.
But this is not the same as saying that all public data is up for grabs. Intellectual property can still be available online, but it’s still copyrighted. Replicating this data or even using it is unethical.
data:image/s3,"s3://crabby-images/8f08e/8f08ea3a685a7cd0b3c06727608e13cd2edf387b" alt=""
You also have to use your own common sense here. You might be web scraping a website that already has private or protected data publicly displayed. If you simply scrape the site en masse, and not consider the actual information, this can end up in the raw data on your side.
Take Only the Data You Need
This one is just simply good advice. When web scraping, extract data relevant to your needs and nothing more. This way, you’re significantly less likely to scrape data you weren’t supposed to.
It’s also another way of saying that you should plan your web scraping projects ahead. Understand what you want to accomplish, and what information you need to do that. At this point, you can verify if your data scraping plan is ethical or not.
Avoid Personally Identifiable Information
We’ve talked about this already, but it really is important to mention. Web scraping personally identifiable information is unethical – and in many cases, illegal. Regulations like Europe’s GDPR and California’s CCPA, among others, both exist to protect people from companies collecting and using their details.
data:image/s3,"s3://crabby-images/fb396/fb396b03655792f29f4228801903ecfab58bce61" alt=""
You can consider personally identifiable information (PII) as any data that can be connected to – and used to uncover – the identity of a specific person. A debate could be made regarding publicly available information. After all, if someone’s professional email address is on their website, or if their number is in the phonebook (for those of us old enough to remember those), it could be considered non-sensitive.
The real question, however, is what you’re web scraping this data for. If you’re collecting details about individuals without their consent or knowledge, you have some ethical considerations to figure out.
Follow the Terms of Service
When you access a website, the website owner often has terms that you may need to follow. The most obvious of these is when you have to login to a website. In order to log in, you are effectively agreeing to that website’s terms – which may or may not include anti-scraping clauses.
But not all terms are as clear cut. The EWDCI – the much quicker way of saying Ethical Web Data Collection Initiative – splits Terms of Service into two different categories.
- Clickwrap terms. These are the ToS that you agree to via deliberate clicking. This means you’ve directly agreed to some terms with the website owner, which may or may not include a formal data collection policy. Since you agreed to follow it, read up and make sure you follow it.
- Browsewrap terms. The opposite of clickwrap conditions, browsewrap terms of service refer to those policies that are on the website, but you haven’t directly agreed to.
So how do these terms influence your web scraping terms? Let’s quote the EWDCI directly, since ethical web scraping processes is theirjam:
“The EWDCI holds the position that Browsewrap ToS do not always form binding contracts as users are not necessarily on notice of these terms, nor do they take any active steps to accept them.”
Now, a browsewrap ToS does not mean you can scrape data any way you want. Ethical web scraping still needs to be careful of the data it collects and the tools used. As you’ll see in the next section of web scraping tools, if you end up interrupting the website’s service, you can cause direct harm to the business, not to mention irritate the website owner directly. And that’s pretty dang unethical!
Ethical Proxies & Tools
Ethical web scraping is not just about data harvesting or screen scraping. The very tools you use also need to be ethically sourced and implemented. Ethical web scraping isn’t just about the end results – and by now, we all know the cost of poor data quality – but also in the tools and means that data is acquired.
data:image/s3,"s3://crabby-images/bea98/bea983cf0175d9ab7464b37e62c790d6b209682b" alt=""
On a simple level, we can consider both the proxies used, and the website being scraped. Both also have ethical implications.
Are You First Targeting APIs?
A typical web scraper works by scraping a page in order to scrape data from the HTML, CSS and JavaScript elements. But before you do this, you should first check that the website in question doesn’t offer an API for the same need.
Using an API is still a valid form of data mining, but it’s incredibly more polite. An API represents permission to extract data. If the website owner has set up API for specific information, then it stands to reason that they’re okay with you using this raw data.
What’s more, API is designed for such data requests, as opposed to typical data mining web scrapers, which need to make numerous requests and can have a bigger impact on the server.
Are Your Proxies Ethically Soured?
Nine times out of ten, you’re going to support your web scraping with some form of proxy. Additional IP addresses help mimic your target location, stop you from overloading websites from one IP, and overall stop you from getting banned.
data:image/s3,"s3://crabby-images/4e210/4e2107d96079cfc5f4aa76e479ad6fb46db0302a" alt=""
But where do those IP addresses come from? When dealing with something like a data center, for example, you know that the IP is one of many in a central location. But you’re not the only one using that IP – if someone else before you was doing scrupulous things, you might face the consequences of their actions.
So, you should use proxies from a provider that goes to great effort to ensure their proxies are only used ethically. Say… a company that emphasizes ethical web scraping enough to write a long blog post on it 😉
As for residential proxies? The need for ethical usage is even greater. Residential proxies come from genuine, residential IP addresses (hence the name, no?). How those proxies are sourced is just as important. Does the IP’s original owner know what they’re being used for?
Fast, Effective, Efficient
Powerful API backed by ethical proxies.
data:image/s3,"s3://crabby-images/cea12/cea1275512c9994e6d64149076206e6dcf841156" alt=""
We don’t give you complete freedom to do what you want with residential proxies. We adhere to ethical scraping projects. And the owners are fully aware of how we operate. It helps us build trust and, perhaps even more importantly, it ensures you get proxies that don’t already have a bad reputation.
Do You Have a User Agent String?
Did anyone tell you it’s always polite to introduce yourself? When web scraping, your identify is often masked already behind proxy servers, subnets and other means. But you should still be polite 😉
For websites, the closest thing to a calling card is a user agent string. This helps in terms of letting the website owners know who you are and what you’re doing. People who implement brute-force hacks or DDoS attacks don’t tend to be so honest.
Are You Harming The Websites You Scrape?
Unless you’re new to web scraping, this one should be obvious. If you make too many requests, you can end up causing issues with the website’s server. In turn, this can impact the website and even the business overall.
data:image/s3,"s3://crabby-images/dad5f/dad5fec6f0cc6f8f5dd21baed9fd2ae544b2449a" alt=""
Perhaps even worse, if you send too many requests too frequently, you could be mistaken for a cyberattack. This would then trigger that company’s security measures, further adding to the mess. There is no standard measure for all websites here, but responses can range from limiting access, implementing additional CAPTCHAs and other means in a bid to catch the rampant bots or spam machines.
And, of course, you could just end up getting your IP banned. Then nobody really wins.
Welcome to Ethical Data Collection and Web Scraping
If you made it this far, either you’ve reconfirmed your web scraping processes are fully ethical, or you’ve learned a few things to help ensure ethical data collection from hereon out.
The simple answer is that data scraping efforts must be ethical in processes, means, information and overall purpose. A scraping process is completely ethical only if you are.
But if you want some pointers:
- Only take what you need. Data quality is expected, always, but this starts before you extract data at all. Build your web scraper to only take what is required and ethically allowed. It’s the only form of data mining that’s going to keep you in our good books.
- Respect each website. Don’t overload it, respect the terms of service, and be as friendly as possible.
- Use ethical proxies. Good people use good proxies. Bad people don’t even check.
So, what’s next? We can help you with our ethical proxies – ranging from data centers to residential proxies. We’re here to support you in achieving your goals the right way.
Fast, Effective, Efficient
Powerful API backed by ethical proxies.
data:image/s3,"s3://crabby-images/cea12/cea1275512c9994e6d64149076206e6dcf841156" alt=""
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.