Web Scraping with RSelenium and Rvest
Web scraping’s benefit is being able to pull data from sites comprehensively so that you can use it to make business decisions or accomplish other goals. However, one of the most common challenges in web scraping is being able to capture data on dynamic web pages. These are pages that require input or to work around challenges, such as logins or CAPTCHAs. RSelenium is a package that contains valuable tools to navigate websites, and when it is paired with the rvest package, it can scrape dynamic pages, making it a critical component of your web scraping journey.
If you have spent some time learning about web scraping benefits, including how to use tools like RSelenium, you know the value that this process offers. This guide will break down when and how to use these tools to help you accomplish the tasks you have in mind. What you will find is that these tools can make web scraping easy.
What Makes a Dynamic Page Unique
To understand the need to use RSelenium and rvest, you must first know why dynamic pages require something more than basic web scraping tools for static pages, such as Scrapy.
A static page is one in which the web page itself contains the same information for all users. While it may be updated over time, every person who comes to the page gets the same information and accident. There is nothing to do extra to gain access to specialized content or data.
Dynamic web pages are different. They require the user to input some type of information. The page is interactive, and that means a person – or something acting like a person – must provide key information to gain access. Traditional web scraping tools are not equipped to handle this.
Using rvest alone may help you with static web pages. However, for dynamic pages, you need to pair it with a tool like RSelenium.
How to Install rvest in R
In order to get to grips with rvest, we first need to ensure we have the right tools. Here we’re going to show how to install rvest in R so, if you already have this, you can move forward.
Anyway, to get started, you will need to do a few things.
Install Java. If you do not have the most up-to-date version, go to java.com to download it.
Download the R packages you need to use:
- library(tidyverse)
- library(rvest)
- library(RSelenium)
Start Selenium next. Using the rsDriver(), use the following code:
rD <- RSelenium::rsDriver()
If everything is working okay, you should see a new Chrome window open. It will look like a blank browser page. If this does not happen, you may need to update your ChromeDriver version.
Basic Strategies for Using Rvest for Web Scraping
There are numerous tools that you need to know how to use to get the most out of any of these developer packages. The following are some of the most important parts to know in order to use this tool for web scraping. If you need more information, check out the package documentation provided for R programming.
The following are important strategies for basic use:
- Navigate to a specific url: navigate(url)
- Go back (this is the same as hitting the back button on a browser): goBack()
- Go forward (this is the same as hitting the forward button on a browser): goForward()
- Retrieve the URL of the current page: getCurrentUrl()
- Reload the current page: Refresh()
- Maximize the size of the browser window: maxWindoSize()
- Instantiate the browser (to send a request to the remote server to do this, which may be necessary if the browser closes due to inactivity, for example): Open(silent = FALSE)
- Get the current page source: getPageSource()[[1]]
- Close the current session: Close()
So far, you have all of the basics, and they are pretty straightforward to use. Try them out to make sure you’ve got the code down well.
Next, we need to work with elements. This is what is going to help you dig in and get the information you need. These are, again, the basic starting points. You can find additional details available in the previously mentioned guide.
Search for a page element:
findElement(using, value)
This will search for an element on a page starting with the document root and will return it as an object of the webElement class. If you need help using this tool, consider SelectorGadget, a Chrome extension that can provide more guidance on the process of locating elements.
Highlight the current element:
highlightElement()
This piece of code will help you to check that you selected the element you wanted to select.
Clear a TEXTAREA or text INPUT element’s value:
clearElement()
Click the element:
clickElement()
This task will allow you to click on links, check boxes off, or use dropdown lists.
Send a sequence of keystrokes to an element:
sendKeysToElement()
This data will send keystrokes as a list. You will need to use plain text under the unnamed element of the list. The keyboard entries are “selKeys” and you will find them listed with the name “key”.
Using these tools and rvest in R, you can accomplish a variety of tasks. You can also find more of the elements to use available in the program guide.
How to Select Nodes
Using CSS selectors to find nodes is an option for some projects. These are the codes you need to know to do that:
- html_node(page, “div”) # div elements
- html_nodes(page, “div#intro”) # div with id intro
- html_nodes(page, “div.results”) # div with class results
- html_nodes(page, “div#intro.featured”) # div with both id and class
- html_nodes(page, “div > p”) # p inside div
- html_nodes(page, “ul > li:nth-child(2)”) # second li in ul
To extract information, you need to understand all of the extracted attributes, text, and HTML. That includes the following:
- text <- html_text(nodes) # text content
- html <- html_contents(nodes) # inner HTML
- hrefs <- html_attr(links, “href”) # attribute
- imgs <- html_attr(img_nodes, “src”) # attribute
If you are using this process to extract information from a table found on a website, you will need to use this code:
tables <- html_table(html_nodes(page, “table”))
df <- tables[[1]] # extract as dataframe
You can also use xpath selectors in situations where the queries are more complex overall. To do that, use this code:
html_nodes(page, xpath = ‘//*[@id=”intro”]/p’) # xpath selector
html_text(html_nodes(page, xpath=’//p[@class=”summary”]’)) # xpath and extract text
How to Parse and Navigate
The next step is to learn how to parse the document structure to find the information you need. Use the following code to help you start the parsing process:
url <- “<http://example.com>”
page <- read_html(url)
title <- html_text(html_nodes(page, “title”))
h1 <- html_text(html_nodes(page, “h1”))
links <- html_nodes(page, “a”) # all links
Next, you will need to tell the system to navigate from one page to the next. To do that, use the following:
other_page <- read_html(links[12])
submit_form(page, submit = “login”, user = “name”, pass = “password”)
Troubleshooting the process is often necessary (there is a lot of code to know and use to make this process of web scraping easier to manage!) Here are a few key components you need to use.
If a set user agent needs to avoid a bot blocking, use the following code:
page <- read_html(“<http://example.com>”, user_agent = “Mozilla/5.0”)
If the HMTL is malformed in some way, you can use the following code as a workaround:
page <- read_html(“<http://example.com>”, encoding = “UTF-8”) # specify encoding
html_name(nodes) # fix missing attributes
page <- repair_encoding(page, encoding = “ISO-8859-1”) # detect and fix encoding
These tools are more robust than just these previously mentioned codes. If you are considering more advanced steps and want to learn a few more of the most likely to be used codes, consider the following.
If you want full interaction using RSelenium, which can be common in some situations, you will need to use the following code:
remote_driver <- rsDriver(browser = “firefox”)
remote_driver$open()
remote_driver$navigate(“<http://example.com>”)
page <- html_from_driver(remote_driver)
title <- html_text(html_nodes(page, “title”))
remote_driver$close()
There are also situations where you will need to get around a login (which is one of the limitations of other web scraping tools). To do so, use the following code for a custom login:
page <- html_session(“<http://example.com>”) %>%
jump_to(“login”) %>%
fill(“username”, “user123”) %>%
fill(“password”, “secret”) %>%
submit_form(id = “login-form”) %>%
jump_to(“account”)
Now, one of the most important steps for many projects is to be able to scrape JavaScript generated content. This is not easily done with other tools, which is why using rvest can be so valuable to these projects. The key to doing this is using the following code:
page <- rvest::html_session(“<http://example.com>”)
page <- rvest::jump_to(page, “dynamicContent”)
html <- rvest::html_text(page)
To extract element names, consider the following code:
names <- html_name(nodes)
Extract child nodes using the following:
children <- html_children(node)
Extract sibling codes using the following:
siblings <- html_siblings(node)
Those are more of the codes and details you need to create highly effective web scraping using rvest. However, the only way to know if you have mastered it is to try it out!
What Is the Difference Between Rvest and RSelenium?
One of the questions you may have is, what is the difference between these tools? Let’s break this down.
First, rvest is a tool that helps you to read an HTML page and then extract elements from it. You can interact with forms on the web page or even put in passwords to gain access to the content. What makes rvest so beneficial is that it is very efficient to use and handy – it is quite a robust tool to help you navigate the details you need on web pages.
Second, RSelenium is different for several reasons. First, you cannot interact with Java using rvest, which creates a problem for any website that has a Java design to it (which are numerous, of course!) RSelenium lets you have a web browser that allows you to control from code. That means that you can do everything necessary on the page using just code. This is designed to facilitate testing websites for both loading and other problems.
In many cases, you can just use rvest. However, you will need to use RSelenium in situations where you need to interact with Java. No matter what the best web scraping tools are for your needs, it helps to have a good understanding of these two tools.
How to Use Rvest and How to Use RSelenium
Let’s provide some basics on how to use rvest since it is such an important component for web scraping online. We can also offer some best practices for using RSelenium to achieve your goals. Keep the following in mind:
When it comes to web scraping, remember to only use these or other methods in an ethical manner. You do not want to engage in activities that put you or your company at risk.
- Avoid scraping data that is behind a login wall, or that requires payment. Unless you have authorized permission to do so, it is a good idea not to capture this data. You could be held accountable for that action.
- It is very common to overload servers because of the amount of work these tools (rvest and RSelenium) create. Avoid putting in too many requests at one time.
- Before you get started, learn the rules. Most websites will have some information about web scraping and when you can and cannot engage in those actions.
- Always protect your privacy throughout this process. If you have not done so, read our comprehensive guide to proxy meaning and when to use it. Proxies can help to provide you with a layer of privacy while engaging in these activities.
If you run into problems with HTTP errors, including 503 and 404 status codes, the following code can help you troubleshoot the process:
page <- tryCatch(
read_html(“<http://example.com>”),
error = function(e) e<
)
if(inherits(page, “error”)) {
# handle error
} else {
# scrape page
}
Here is an example of how to interact with dynamic pages. Check out these RSelenium code examples:
library(RSelenium)
driver <- rsDriver(browser = “chrome”)
driver$navigate(“<http://example.com>”)
page <- driver$getPageSource()[[1]]
html_text(html_nodes(page, “#dynamic-content”))
driver$close()
If you plan to parse in XML, you can do so with rvest using xml2. To do that, use the following code:
library(xml2)
page <- read_html(“data.xml”)
xml <- html_xml(page)
nodes <- xml %>% xml_find_all(“//item”)
How Rayobyte Can Help You Navigate Web Scraping with Ease
Using RSelenium and rvest are two of the most effective ways to web scrape data on dynamic pages. At Rayobyte, we can offer you another layer of help by providing you with resources for using proxies as a component of the process. With so much at risk and so much value present when it comes to web scraping, you can take full advantage of the process using Rayobyte’s proxy services.
To learn more about Rayobyte, check out our services now or give us a call. Contact us now for more information.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.