The Ultimate Guide to Web Scraping in R and Importance of Proxy
Veteran web scrapers know web scraping is an easy, efficient way to collect vast amounts of information from the internet in minutes. If you’re interested in performing customized web scrapes, you can even write your own scraping program.
If you’re looking to write your own web scraper, you need to choose the right coding language to develop your program. One of the most popular web scraping languages is R, a free, open-source language that’s known for its versatility. If you’re at all familiar with programming in R, you can use it to write website scraping programs and collect data in no time.
This is the ultimate guide on how to web scrape in R. It’s full of helpful information, including why R is an excellent scraping language, how to write your own web crawler in R, and best practices for getting better data. Find the information you need right away by checking out the table of contents below.
Why Start Web Scraping in R?
The coding world has dozens of programming languages that can be used to collect data, so it can be hard to choose the right one. What makes R the right choice? R is a language specifically designed for collecting and analyzing data. Since that’s the whole point of web scraping, R is a great solution. Here’s what sets R apart from other programming languages you can use to scrape sites.
What is R?
The official R website, R-Project.org, states that “R is an integrated suite of software facilities for data manipulation, calculation and graphical display.” Basically, R is a language that was designed from the ground up to be a self-contained, flexible system that can easily handle, analyze, and display data.
But what makes R so effective as a web scraper? It has many strengths that make it particularly well-suited for data collection and analysis. For example, R was designed to support data analysis, so it’s fundamentally well-structured for making sense of the information you collect.
R is also a useful language because it’s:
- Open-source: R is free to use and open-source. Anyone can use R to program anything they want. The open-source license also means that people can modify the code however they need to, as well. While you probably won’t need to modify the language for your web scraper, it’s still a helpful tool to have.
- Platform-independent: The team supporting R has made sure that the language can run on just about any platform, including Windows, the Mac ecosystem, and Unix systems.
- Supported by a community: R is a living language that’s constantly being improved by a large, engaged community of users. This community is constantly creating new libraries and packages you can use to make your R-based programs more effective. You can also reach out to them for help if you run into a thorny problem.
- Connected to other languages: The open-source nature of R means that people have played around with it and produced packages to connect it with other languages. Today, languages from Python to Java to C++ can all be used with R if the correct packages are integrated into the program.
While R does have many benefits, it’s nowhere near the only language people use to write a web scraper. Here’s how web scraping in R stacks up with C# and Python, two other common coding languages used to run data scraping programs.
R vs. C#
R benefits:
- User-friendly
- Free
- Open-source
- Easy to pair with just about every language
R drawbacks:
- Low security
- Slower than C#
C# benefits:
- Very fast
- Extensive library support
C# drawbacks:
- Expensive
- Resource-intensive
- Difficult to learn
Microsoft developed C# as a programming language to support modular computer programs. It’s very well-known, with hundreds of thousands of programmers using it worldwide. However, C# has a significant downside: it’s expensive. Compared to the open-source nature of R, it’s costly to run a C# program.
In terms of complexity, C# and R are about equally difficult. Both take time and effort to learn, and neither is for complete beginners. Still, while C# requires a dedicated platform to run, R does not. That makes R more accessible in the long run.
Verdict: People who already know C# and have a way to run it may prefer to write with C#. Otherwise, R is easier and more accessible.
R vs. Python
R benefits:
- Dedicated to data analysis
- Prebuilt packages for web scraping
- Excellent support for large datasets
- Quick, lightweight programs
R drawbacks:
- Slower than Python
- Less flexible than Python
Python benefits:
- Flexible
- Faster than R
- Wide variety of libraries
- Easy to learn
Python drawbacks:
- Doesn’t support large datasets well
- Programs quickly become long
Python is a language that many people use to write simple web scraping programs. It’s a fast, flexible language that’s excellent at handling online interactions. It’s also simpler to learn than many other languages since it’s designed to be easy to read.
The biggest drawback to Python is that it’s not explicitly designed for large-scale datasets. Furthermore, its simplicity means that programs can become quite long. It’s more beginner-friendly for general coding, but it also takes more work to set up a functional web scraper than with R.
Verdict: If you don’t have any programming experience at all, it’s probably easier to learn Python than R. However, if you’re willing to learn, R has more dedicated data analysis tools, which makes web scraping with R easier in the long run.
How to Scrape Web Data in R
No matter what language you use, the process of web scraping is essentially the same. Scraping is the process of having a bot interact with a website’s HTML code. The bot reads the HTML looking for certain specified types of information. When it finds that information, it copies and saves it to be printed into your designated file format.
Before you can do any of that, though, you need to do some prep work.
- Choose the information you want to collect: Before you can start scraping, you need to set specific information targets. You can collect literally any kind of information that’s publicly available. To build a working scraper, you need to be specific about what you’re targeting. For your first scraper, it’s easiest to target something simple like product names and prices.
- Learn how the sites you want to scrape are structured: Every site is built a little differently. Different programmers will build site elements using a variety of variable names. Page URLs may be unique, or they may just be sequential digits. Look for patterns in how the site is built that you can use to your advantage when you’re scraping. A simple way to do this is by visiting the page in your scraping browser and examining it through the browser’s built-in developer tools.
- Consider connecting with an API: Some sites prefer that you don’t scrape them because scraping can strain a server. These sites often offer an Application Programming Interface (API) that you can connect to and download all the data you need. If a site offers an API, it’s easier and more ethical to use that than to perform a scrape.
Now that you’ve done your homework, you can actually start writing an R web scraping program.
Web Scraping with R Tutorial
Making a web crawler can seem intimidating, but it doesn’t have to be. If you have any background in the languages R, S, or the C family, you can put together a web scraping program in less than an hour. But if you’ve never written a scraping program yourself, it helps to first understand how the bot will interact with a website.
Understanding HTML and CSS
HTML is the fundamental structure of the internet. Just about every part of every website is eventually rendered through HTML. Browsers use HTML to know how to render web pages and what to display.
HTML code relies on elements and tags. Tags, like <h2>…</h2>, set everything from titles and headers to body text and images. Tags determine what kind of element-specific text is. Web scrapers use HTML tags to identify the information that a person wants to collect. Before HTML is rendered into an entire site, it’s just lines of code that websites understand how to translate into a graphical user interface.
CSS is used alongside HTML to make sites more interesting. It allows sites to set dynamic appearance elements like positioning and color of different elements. Some websites use CSS to set apart certain HTML elements. CSS is structured with brackets, like this:
h2 {
color: red;
text-align: right;
}
You can also use CSS to identify specific kinds of information with your web scraper. The scraper reads the HTML and CSS and searches for the specific tags and CSS specifiers the programmer chose. Then the program collects that information, saves it to memory, and prepares to print it to the format the programmer designates.
Now that you understand what you’ll be working with, you should be able to easily follow this R web scraping tutorial. Here’s how to start writing an R web scraping program that works for you.
0. Install R
If you don’t already work with R, the first step to writing your web scraper is to install R on your computer. Follow that link, choose the most up-to-date link for your operating system, and follow the instructions to set it up.
Installing a programming language also installs the console and the instructions for how to run that language. You can write an R-based program without installing it, but you won’t be able to test it or check for errors. Since R is free and open-source, there’s no reason not to install the version designed for your computer.
1. Choose a coding environment
A coding environment is the place where you write the code that runs your program. R comes with its own coding environment built into the installation. Many people writing in R choose to use this built-in console, but some like other Integrated Development Environments (IDEs). These IDEs offer extra features, such as visualization and easy debugging solutions. The two most common alternatives are RStudio and Visual Code Studio. You can even write R in a notepad file and paste it into the console or command line if you’re confident in your programming skills.
2. Add important scraping libraries
Using R to program your scraper bot means you have access to a bunch of helpful code libraries. A code library is a collection of pre-written programs and code that handle simple tasks for you. Using a library helps you avoid reinventing the wheel every time you want to perform a routine function. Since R is popular and open-source, there are thousands of libraries you can use to make your web scraping bot.
In R, there’s one main library that will help you make your scraping program the best it can be. The rvest library is an all-in-one web scraping resource that you can use to build your own scraper bot.
You can use other libraries, too. The rvest library works well with other ” tidyverse” libraries, which are excellent for keeping data neat and “tidy.” However, if you’re just building a simple scraper, you can stick with rvest for now.
If you want to scrape more than one page at a time — which you probably do — you should also install the polite library. This library will help you avoid overloading a site and potentially getting your web scraper blocked.
To install and important these two libraries, go into your command line and type in:
install.packages(“rvest”)
install.packages(“polite”)
This will download the libraries to your computer for future use.
Next, go to your R console or IDE and create a new program file. Type in:
library(polite)
library(rvest)
This is the start of your scraping program. Once you run your scraper, these lines will cause your computer to initiate these libraries and use the related commands for the program.
3. Access a webpage
One of the nice things about R is that you don’t need to specify a specific browser to start scraping website HTML. As long as your machine is connected to the internet, your program will automatically find the appropriate HTML file through the URL based on some backend magic in the rvest library.
To access a webpage, you need to set the URL of your target pages as a variable. The names of the variables don’t matter as long as you know what they mean. For example, if you want to collect the names of different kinds of coffee beans from this post on Home Stratosphere, you would write:
url <- bow(“https://www.homestratosphere.com/types-of-coffee-beans/”, force = TRUE)
This line does a lot of things all at once. First, “bow( https://www.homestratosphere.com/types-of-coffee-beans/)” instructs the program to use the polite library to visit the page. The polite library then checks the website’s “robot.txt” file for instructions about how the site prefers to be scraped. Behind the scenes, polite makes sure you’re not visiting the site too quickly. The “force = TRUE” section saves on memory, allowing the program to forget the robot.txt information after it has collected the relevant information.
Meanwhile, the “url <- “portion sets the entire “bow” process as its own variable. That allows you to run that program just by referencing “url” in later lines of code.
Next, you want to have your program actually read the code for the site:
4. Collect data from the site
At this point in your program, your code should look like this:
library(polite)
library(rvest)
url <- bow(“https://www.homestratosphere.com/types-of-coffee-beans/”, force = TRUE)
Now you need to instruct your program on what to do with this information. The polite library conveniently has a command simply called “scrape” that automates a lot of the rvest process for you. The basic form of the scrape command looks like this:
info <- scrape(url)
This sets the results of your scrape as yet another variable, “info.” However, you also need to tell the scraper what information to collect. This makes the line a little more complicated.
First, you need to figure out how the information you want to collect is stored. The easiest way to do this is to look for the HTML or CSS selectors in which the information is stored. On the coffee bean page, each bean name is stored under an “h2” designator. To specify this, use “html_nodes()” and “html_text()”. The html_nodes() action will find all HTML nodes specified in the command, and html_text will collect the text within those nodes. That will look like this:
html_nodes(“h2”) &>&
html_text2()
You’ll notice we’re using a unique signifier designated by rvest: %>%. This symbol acts as a pipe, causing the program to perform a sequence of actions to the same item. An elementary version will look like the following:
info <- scrape(url) %>%
html_nodes(“h2”)
html_text2()
However, this still doesn’t give the program a way to list things in a specific order or list it under a title. You can add that with the “query=list()” command. For example, you could have your program list the information you collect with the title “Coffee Beans” through the command below.
query=list(t=” Coffee Beans”)
The entire collection of code you just wrote fits together like puzzle pieces:
info <- scrape(url, query=list(t=”Coffee Beans”)) %>%
html_nodes(“h2”) %>%
html_text2()
5. Print results
Now you’ve finally gotten to the point where you can ask the program to list everything it’s found. This is as simple as just listing the “info” variable on its own:
info
You can also use the “head()” command to print just the beginning of the information you’ve collected if there’s a lot:
head(info)
The complete program
Congratulations! You’ve written a simple, complete web scraping program. Here’s the entire program all in one place:
library(polite)
library(rvest)
url <- bow(“https://www.homestratosphere.com/types-of-coffee-beans/”, force = TRUE)
info <- scrape(url, query=list(t=”Coffee Beans”)) %>%
html_nodes(“h2”) %>%
html_text2()
head(info)
Of course, you can adjust this in several ways. The fundamental structure of this scraper is built on the rvest and polite libraries instructions. Check out these libraries’ pages to see extended examples of how they’re used and how you can expand your scraper further. For example, rvest includes simple functions like “html_element()” and “html_table()” that help R read HTML tables and return them in their original form.
R Web Scraping Examples
Once you’ve written a basic web scraping program, the sky’s the limit. You’re free to start collecting all the data you need for your business or your personal curiosity.
There are plenty of examples if you’re looking for ideas. A common use for web scrapers is to collect pricing information from online retailers. Businesses perform these scrapes to gather information about what products their competitors are selling and for how much. Some companies perform individual scrapes before releasing new products, while others perform regular scrapes to remain competitive with their pricing structure.
You can collect price information for your personal life, too. If you want to buy something expensive, it makes sense to find the lowest possible price. You can scrape retail sites to find their current prices on that high-end appliance you want. If you perform regular scrapes, you can stay on top of sales and get your appliance for the lowest possible price.
Other reasons to start web scraping with R include:
- Collecting hotel and travel information: If you don’t want to rely on Google Travel, you can scrape travel sites to find available hotel rooms, cheap flights, and good deals.
- Gathering stock data: If you care about stocks or cryptocurrency, you can scrape the relevant websites to get minute-to-minute price changes and updates.
- Scanning social media: Companies can scrape social media to keep an eye on customer opinions on brands. Meanwhile, individuals can scrape these sites to look for references to bands, events, and comments that interest them.
- Monitoring real estate: Buying real estate is complicated. With a web scraper, you can easily collect property descriptions, prices, and other details that will help you find the perfect place to buy.
Analyzing Data After Web Scraping with R
The biggest benefit of R is how it helps you analyze the data you’ve collected through web scraping. The R environment was designed with in-depth statistical analysis in mind. That means that you can get a lot of high-quality analysis and visualization done in just a few lines.
The easiest way to perform dataset analysis in R is through the “dplyr” library. This library makes it easy to perform analysis on even the largest datasets. Once your program has generated a table of data, the dplyr library offers single-word commands to perform many common types of analysis.
You can perform tasks like:
- Column selection: It’s easy to print just certain columns from a structured dataset. With select() you can choose the columns you want to display and ignore the rest.
- Data filtering: The filter() allows you to only display information that fits a certain column’s status. For example, you can set filter (price == 9.99) and the program will only print the column entries for which the price column is 9.99.
- Data ordering: If you want to display data in a certain order, you can use arrange(). Using arrange(price) will order the rows from lowest to highest according to the price column.
- Deriving columns: R even lets you create new columns in your dataset. The mutate() command will let you determine new columns based on the information of two other columns combined. Using mutate(price_per_ounce = price / weight) will multiple each row’s price by the weight in the correlating column to determine how much the object costs per ounce.
If you want to perform a more in-depth analysis, you can figure out just about anything you want with the right dataset. The R Journal offers nearly every kind of tutorial and example you could need for using R to analyze massive datasets.
R Web Scraping Best Practices and Tactics
There’s much more that you can do to improve your web scraper once you’ve gotten started. You can start scraping data from websites using R more efficiently than ever by implementing a few best practices. If you’re ready to improve your scraper, try these expert-level tips.
Scrape multiple URLs at once
Would you get in your car to visit the house next door? Probably not. That’s basically what you’re doing if you write a web scraper to collect information from a single web page. If you’re bothering to use a dedicated machine, you should make it worth your while and scrape multiple URLs at once.
The easiest way to scrape many pages at once is to institute loops in your code. There are multiple ways you can accomplish this.
Some programmers prefer to add the “purrr” library or “lapply()” command to their programs. These allow you to repeat the same scrape across a variety of URLs. Others use the “tibbles” library to create data frames and collect data from a wide range of pages. Either way, you’ll create a loop where the program switches to a new URL after completing each individual scrape and repeats the process all over.
Connect with other languages
One of the fantastic benefits of R is how well it can connect with other programming languages. You can connect R with Python, C#, and many other coding languages with free packages. RStudio even has native integration to call other kinds of programming if you choose.
That can be invaluable. R is only designed to scrape HTML content. If a website relies on JavaScript (which many do), you’ll need to work with another program. Connecting to the PhantomJS program allows you to work with a JavaScript program and load the relevant information into your R-based scraper. PhantomJS is a headless browser that will force load the JavaScript elements for you so your program can collect and analyze them.
Create a human scraping pattern
Many websites have security features that work to identify and ban bots. There are various reasons for this, from protection against malicious hackers to simply avoiding server strain. One of the most common ways sites do this is by looking for robotic behavior patterns among users. This includes visiting a new page too quickly and visiting too many pages in a short time. If you want to avoid being banned, it’s worth slowing down your program so it looks a little more human.
If you’re using the polite library in your scraper, you’re already doing this! Polite reads the site’s robot.txt and automatically slows down the scrape to the website’s preferred speed. If you’re just using rvest, however, it’s worth getting some help. Add polite to your program or work with a premade scraper like Scraping Bot to keep things running smoothly.
Prepare for forms
There are forms and dropdown menus all over the internet. With rvest, you can prepare to work with these forms neatly inside your scraper. As long as the form is built with HTML, you can scrape it with rvest’s “html_form()” function.
There are three specific form-based actions rvest supports. The html_form() action will extract the form information of any variable or URL within the parentheses. The html_form_set() action allows you to set a form field like a dropdown menu or blank space. Finally, the action html_form_submit() lets you submit a specific form.
For example, you could write a scraper that looks like this with just rvest:
Site <- read_html(“http://www.wikipedia.org/en”)
Find <- html_form(Site)
Find <- Find %>% html_form_set(q = “football”, hl = “fr”)
This sets the variable “Find” to fill the search field (q) with the word “football” and the language field (hl) as French. To get the results of that search, you can add:
if (FALSE) {
resp <- html_form_submit(Find)
read_html(resp)
}
If the page doesn’t update immediately, the rvest library states that this will submit the form and get results.
Use proxies
Every good web scraper should be used in conjunction with proxies. A proxy helps shield your IP address from bans and malicious internet users. When you use a proxy, you don’t have to worry about your personal IP address getting blocked by the websites you’re trying to scrape. Furthermore, you can make your scraping look more human and avoid getting IPs banned — period.
You can neatly add proxies into your program with the “Sys.setenv()” command. This can be as simple as the following:
Sys.setenv(https_proxy = 154.16.53.16:8080
http_proxy = 154.16.53.16:8080)
This lets the system know to use this proxy when accessing the internet. Obviously, you can insert your own proxies instead!
You can also use a program like the Proxy Pilot proxy management application. This simple-to-use program will automatically manage your proxies for you. You don’t need to change your scraper, period.
The 6 Most Common Complications when HTML Scraping with R
No web scraper is problem-free. Whether you’ve just written your first R web scraping program or you’re running into new problems with an older bot, there’s always a new, weird error to fix. However, most scraping problems can be fixed with a bit of research. Here’s what you need to know to fix the nine most common web scraping problems.
Authentication
Many sites require you to log in before you can access the information you need. If this is true about a site you want to scrape, you should use the rvest “html_form()” commands to log in before you try to scrape. You can use the html_form_submit() to submit the forms you need and travel to the next page. Put this before starting the scraping loop so you don’t log in before every page and appear suspicious.
The easiest way to find if you need any of this information is to manually visit the site and use developer tools like Google Chrome’s DevTools to see what the server is requesting. From there, you can neatly add the data to your bot.
Honeypots
Honeypots are traps designed to spot and ban bots crawling a website. They’re trap links that are hidden behind invisible CSS elements on the page. The only way for a user to find these links is to inspect the HTML and click on them specifically. Bots that are designed to click every link on a page will see the link in the raw HTML and get banned. You can neatly avoid honeypots by programming your bot to ignore anything that’s behind an invisible CSS element.
JavaScript elements
JavaScript is a valuable tool for website designers because it allows for asynchronous loading. A site like Tumblr that provides “infinite scrolling” uses JavaScript to continuously load new information onto the page without reloading the entire site. Behind the scenes, the site makes AJAX calls that bring up this new information whenever someone scrolls to the bottom of the page. If you’re web scraping using R, that means you’re not going to get much data from the HTML on the site.
You can fix that with a few tools. The most effective is PhantomJS. Instead of just pulling up the HTML, the PhantomJS headless browser will run your web scraper in an actual browser instance. That will load the information and convert it into HTML that the program can read.
Redirects
If a page is trying to redirect your bot, you’ll likely get 3XX errors in your results. You can follow redirects by implementing the RSelenium library. RSelenium works with the Selenium tool to help you follow any redirects that may come up.
Robotic behavior
If your bot is too robotic, you’re going to trigger anti-crawling security measures on many sites. These programs look for users who behave too regularly, like constantly clicking in the same place or clicking too quickly. Using the polite library can help you slow down your scraping speed and seem a little less robotic to these bot finders.
Unstructured HTML
The most challenging problem any web scraper faces is simple: poorly structured HTML. A site may have unstructured HTML because it uses server-side CSS classes and attributes, or it may just be poorly designed. If you’ve run into a site that doesn’t follow a pattern, you may need to just brute-force your way through it one page at a time.
The Best Web Proxies for Web Scraping Using R
The most common problems for most web scrapers all revolve around anti-crawling technology. Luckily, that means they can all be fixed with appropriate security. By implementing the right proxies, you can make your scraper seem more human and avoid triggering bans that can ruin your data scraping in R. The question is: which proxies will keep your bot safe?
Rotating residential proxies for your scraping needs
When you boil it down, the safest, most reliable proxies are rotating residential proxies. These are IP addresses that appear to come from a person’s home. Behind the scenes, the proxy provider automatically swaps out the proxy you’re using. This way, you never use one proxy for too long.
The result is excellent for web scrapers. Since the proxies are constantly changing, sites don’t have time to hone in on one IP address as acting suspiciously. Furthermore, most sites avoid banning IP addresses that appear to be coming from a natural person, so residential proxies are less likely to get banned in general. You can work with Rayobyte’s rotating residential proxies to keep your scrapes safe from bans and other problems.
Rayobyte offers other practical proxy solutions, too. If you’ve connected it to Python, you can quickly implement the Proxy Pilot program into your R web scraping bot. It’s a free proxy management solution that will handle problems like cooldown logic, proxy retries, and rotation for you. That saves you from having to write these complicated programs yourself.
If you’re using a Rayobyte residential proxy, Proxy Pilot is already built-in. That makes the proxies particularly effective for web scraping. Let Rayobyte rotating residential proxies do all the work for you.
Get Started Today
You’ve done it! You’ve learned everything you need to know about R, including how it works for web scraping, how to write a great scraper, and how to address common problems. You’re ready to make your own scraper and collect all the information you could need. The last step is to make your scraper as effective as possible by adding Rayobyte proxies to your program. You can keep your scraper and IP address safe without having to write all the code yourself.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.