Rust Web Scraping: How and Why You Should Consider It
By now, you’ve probably heard of the merits of scraping data from the web to provide valuable insights into your competitors and your industry at large. (Web scraping, as a reminder, is the typically automated trawling of publicly-available data on the internet, after which you can analyze it for strategically valuable patterns and information).
While you could always scrape the web manually, automation makes it possible to collect the reams of data you need to make it a going concern, using preexisting libraries and scripts (sometimes modified for specific ends). Once you collect the data, there are different ways to parse the HTML and search through it with software or spreadsheets.
Several languages are popular with developers, each of them having their own strengths and weaknesses. But over the last few years, some developers have fallen in love with a language called Rust, even getting somewhat evangelical about it. Indeed, the 80,000 developers surveyed by Stack in 2021 dubbed it “the most beloved programming language.” Granted, it doesn’t have the same number of users as, say, Python; but even so, what explains this passion for Rust?
Competitors to Rust
The passion for Rust, which was used to help develop Firefox, may have to do with the shortfalls of other languages. Different scrapers will have their own preferences when it comes to these based on a mix of usability and functionality. Let’s consider some of the other popular choices for web scraping languages:
- Python, the world’s currently most popular computing language, is a fast and efficient open-sourced language for data science projects and web scraping. Python also gives you access to a large third-party requests library which can make it quicker to parse the HTML of websites. Python can struggle, though, to keep up with the changing nature of the web, and you may have to keep personalizing your scraping techniques. It’s also not as scalable as some other options because it doesn’t offer in-built concurrency.
- C++, which is considered to be one of the more versatile and powerful languages. In fact, it’s faster than Python since it’s statically typed, which allows you to compile code more quickly. While C++ has significant library support and supports header files, it’s also resource-intensive and can be hard to learn.
- Java, an object-based programming language developed by Oracle, which is free for personal use. The downsides to Java are that gets very long and takes a lot of time to write, can be slow, and doesn’t support header files. It requires complex codes and soaks up significant amounts of memory. Java is still popular but this seems to be on the wane compared to Python and other simpler options.
- PowerShell, developed by Microsoft, which offers benefits like easy automation. While it works well on any website, PowerShell makes it especially easy to access data on apps and sites developed with the Microsoft-produced .NET platform. However, PowerShell has some security weaknesses and can be open to hacking or other attacks. It’s also object-based, rather than text-based, which can be a far less intuitive way to operate for many scrapers and programmers.
Other coding languages include Erlang, Go, and Bash.
Of these languages, C++ — an antecedent of Rust — probably comes closest in competition to Rust in usability and power, and C++ web scraping is still popular. But “Rustaceans” will tell you that Rust is faster and has better memory safety than C++.
Rust’s memory safety comes from the fact that, unlike C++, it does not allow dangling and null pointers (these are indicators in code that a reference doesn’t refer to a valid object, and they have resulted in innumerable crashes and system errors, leading to a lot of frustration among those who use computers). C++ is a venerable but older language (first appearing in 1985), and even though it’s received numerous patches and fixes, these are not quite the same as having no major security flaws or bugs to fix in the first place.
So, Rust’s relative newness and its origins as a language designed to avoid the pitfalls of its forebears work very much in its favor, as a language geared toward the problems scrapers and developers face, today.
So, What Is Rust?
Software developer Graydon Hoare first developed Rust while working at Mozilla in 2006. Since then, and having gone through several iterations, Rust has achieved a lot of popularity in the world of software engineering. Companies like Amazon, Discord, Meta, and Microsoft all use Rust. Seeing as Microsoft developed PowerShell, this is quite a compliment.
Rust builds on several languages that came before it and evolved from earlier languages like C++ and Erlang. It aims to be rock solid and free of the bugs or hiccups which plague many computing languages — bugs that can easily hold up the important progress of data scraping when you’re looking for pressing answers to your research questions. In the words of its own materials, Rust is a “language empowering everyone to build reliable and efficient software.” It touts:
- Fast, memory-efficient performance;
- Reliability;
- Productivity credentials, including its comprehensive documentation, auto-formatting ability, and integrated package manager, Cargo, which “downloads your Rust package‘s dependencies, compiles your packages, makes distributable packages, and uploads them to crates.io, the Rust community’s package registry.” (Dependencies are the libraries you’re relying on when you work in Rust; it’s a great step forward that Cargo manages these automatically rather than making you deal with them manually.)
So why is Rust necessary? Its developers believe that it represents a step forward in reliability and efficiency. Rust also markets itself as an actively user-friendly language, with a book and a handy online quiz to help people learn the language. The affection users show for it may lie in the way it solves issues presented by other languages. (There isn’t a Rust web scraping book exactly, but there is a lot of information you can use in the manual and Rust’s libraries).
Pros of Using Rust
Rust is memory safe
“Memory safety” in computing refers to software’s ability to remain protected from software bugs and other vulnerabilities in security. This is important when you’re scraping the web, as you don’t want to expose your own personal IP address and data to any sites you may be scraping, and you certainly don’t want to compromise your anonymity. C++ is a powerful language, but its code safety is a cause of concern for some. Rust is also intelligent in the way it uses memory, giving you options as to how you want to allocate and reallocate memory.
Another way of securing your anonymity and privacy is to use proxies when web scraping, which Rayobyte certainly recommends. Proxies are middlemen that sit between the requests made from your IP addresses and the server whose queries you are targeting.
Rust supports concurrent programming
Concurrent programming allows for multiple processes to be executed simultaneously rather than sequentially. This has obvious benefits for speed and efficiency, both of which are crucial for web scraping. Python, for instance, though it provides numerous benefits, does not have a built-in concurrency mechanism. Rust is very scalable, and can handle very large amounts of data without any loss of performance efficiency.
Rust offers great support and a solid community
The Rust-verse includes the aforementioned tutorials, in addition to various Reddits and forums in which “Rustaceans” swap hints.
You may need to master some lingo, but Rustaceans are very happy to teach you. Crates.io is the Rust community’s crate registry, which allows for Rust users to both upload their own “crates” and make use of the preexisting libraries others have uploaded. If you’re wondering what crates are, they’re akin to what other languages call packages. Rust defines them as units of “compilation and linkage.” (A full list of Rust terminology, which you may find a bit needlessly specific to Rust, can be found here). Crates.io boasts 23,428,303,777 downloads, which is a good indicator of the community’s passion. There are a couple specific creates you’ll need to run a web scraper, which you’ll see more about below.
Beyond that, the community established around the Rust language has an impressive rapport. Many heavy hitters from the tech world are involved in the Rust Foundation, which is “an independent non-profit organization to steward the Rust programming language and ecosystem, with a unique focus on supporting the set of maintainers that govern and develop the project.” This speaks highly of its cross-industry support. Plus, Rust looks to be used in the next iteration of Linux Kernel in 2023, an indication that it’s made it as a programming language.
The Rust community is highly devoted to its evolution and sustainability and committed to making sure that it’s as future-proofed as possible.
Cons of Using Rust
With apologies to the lovers of Rust, the language does have some drawbacks.
Rust can deliver errors
Since Rust is so diligent and thorough, it can be prone to returning error messages. On the upswing, these error messages are designed to be in relatively clear language and users should be able to decipher and address them relatively easily.
Rust is still developing
Since Rust is still evolving, with the help of its passionate user base, this can prove frustrating to individuals as commands evolve or different shortcuts emerge. Rust offers a 6-week rapid release process, which some see as an impediment to stability. Still, as most Rust users enjoy the interaction with the community and ways of finding a sense of improving Rust together, this is not always seen as a problem for Rust users.
Rust is not widely used
Using programs like PowerShell or Python can be helpful, since many sites were built with them, and scraping is — in its essence — a kind of reverse engineering. By contrast, while beloved, Rust is not anywhere near as widely-used as those languages.
Rust and Web Scraping: Important Definitions
If you want to create a Rust scraper, the first thing you need to do is install Rust following the prompts in rustup. Mac and Linux make it slightly easier to install than Windows. After that, the best place to start is with the aforementioned crates.io, which has all the packages you need to get started in your web scraping (or any other Rust-based) endeavors.
But before that it’s a good idea to have some definitions clear in mind. That way, you know what you’re doing when you type all that code. Here’s a refresher on the structure of the websites you’ll be scraping.
Websites are made up of a mix of HTML (Hyper Text Markup Language) and Cascading Style Sheets (CSS). HTML is the markup language that structures the content of a site; CSS makes a site presentable, easier to read, and pretty.
Hyper Text Markup Language (HTML)
HTML is the language that conveys the structure of websites. On its own, it represents a site’s essential content, but it would look like very simple text and not be much fun to browse. The content that HTML contains is what you are actually looking for when you scrape.
Cascading Style Sheets (CSS)
CSS makes up web commands that craft the layout of web pages. These include all of a site’s style elements — font, color, graphics, video, animation, and so on.
When you’re looking to break down any site for its data, it will save you a lot of time if you strip away anything extraneous — like graphics, which likely won’t add anything to the information you’re looking to mine — so that you can find the facts, figures, and numbers you’re looking for. Doing this will also help you to know how a site is structured in a kind of branching tree diagram.
CSS Selectors
CSS selectors, which Mozilla unpacks with impressive depth, allow you to cluster elements together when you’re designing (or exploring the design of) a website. In Mozilla’s words, these “define the pattern to select elements to which a set of CSS rules are then applied.” They can be grouped into such patterns as universal selectors, which selects all the elements on a page, and attribute selectors, which select all the elements on a page with a particular attribute, like audio. You can also compound selection to bracket together various kinds of attributes in one search.
CSS selectors allow you to select any node in the branching tree structure that makes up a website’s HTML code. Finding the right selector lets you prune the right branch, as it were.
Scraping an e-Commerce Site With Rust
Common targets of web scrapers are e-commerce sites, whether B2C or B2B. This makes sense when you think about how much shopping is done online, now, thanks to the pandemic accelerating broader trends. E-retail sales surpassed $5.2 trillion in 2021, and those numbers are looking to climb in the coming years. That means there’s a lot of hard data to be found online.
So reasons for scraping that data could be personal — you could scrape, for example, so that you can track the price or availability of a pair of sneakers on different sites to see which has the best offering. But it’s more likely you’ll be scraping for business-oriented reasons. Among the many good business reasons to scrape an e-commerce site:
- Checking competitors’ offerings — price scraping means making sure you are aware of competitors’ pricing trends and special offers, making sure that your business is offering value, too. It doesn’t mean you have to undercut them, as you could decide to offer premium pricing. Rather, it allows you to strategize and find your place in the market in an informed manner.
- Aggregating content — pulling information and deals from other sites to feature on your own if you’re looking to run a marketplace or comparison site.
- Search Engine Optimization (SEO) — Monitoring keywords on competitors’ sites is crucial for SEO. Doing so means you can make sure your content is getting the same number of clicks as your competition if not more. E-commerce SEO is all about increasing awareness of your brand and reaching a wider audience.
- Brand monitoring — If you’re selling goods or services on large e-commerce sites, you can see what customers are saying about you and so get honest feedback about how you’re doing. Scraped data from e-commerce sites which you can then subject to sentiment analysis gives you far more actionable information than polls and surveys, in the eyes of many marketers, as it’s candid and unfiltered. You can also gauge your unaided awareness levels, that is, how well people know your brand without your having to prompt them as to its existence.
Whichever your motivation, by looking at a few different ways to approach scraping a prototypical e-commerce site with Rust, you should be able to get a sense of how well-suited it is to the task and if it’s the right language for you to use in scraping.
Before you scrape, arm yourself with information
Remember that a Rust web scraper (or indeed, any scraper) is a tool at your disposal.
You should give yourself a headstart by looking closely at the site you’re going to try scraping. Knowing its elements and layout will make it that much easier to strategize scraping its data. The reason is simple: when you’re scraping, you’re looking at reverse-engineering the design and layout of a site, generally expressible as a branching tree diagram, while stripping away all the stuff you don’t need. Your search will be all the more focused and Rust more effective if you set good parameters and know what you’re looking for.
Build a Scraper With Rust in Three(ish) Basic Steps
Let’s say our website is called “ecommercesite.com.” There are many ways you could scrape it, but here’s one simple method, using Rust and its package manager Cargo.
The crate to start off with is reqwest to collect the data, after which you will use scraper to analyze it.
1. Create a new project
Run:
cargo new web_scraper
Then open your Cargo.toml file in order to add reqwest and scraper to your dependencies:
[dependencies]
reqwest= {version = “0.11, features = [ “blocking”]}
scraper = “0.13.0”
In the above direction, you’ll note the word “blocking.” This essentially gives you time to gather all the information you need before the target site concludes your request.
2. Get the site’s HTML
In order to extract all of the HTML on a webpage, run the following commands:
fn main() {
let response = reqwest::blocking::get(
“https://ecommercesite.com”,
)
.unwrap()
.text()
.unwrap();
}
3. Extract the specific information
Using the scraper crate, run:
let document = scraper::Html::parse_document(&response);
You should now be able to match your page of HTML text against the actual website page. So if you’re looking for how sneaker price data is coded in its HTML, look at a couple of sneaker prices on the “real” web page and compare them to the coded ones. They should have an element in common — a tag like a number or letter in parentheses.
What you’re doing with scraper is using CSS selectors to highlight the information relevant to your search.
Scraper, like many crates in Rust, provides other functionalities, too. It provides instructions on:
- How to serialize HTML and inner HTML, allowing you to turn data into a stream of bytes that you can then store or transfer (many scrapers use JSON, but there are other options, too)
- How to access descendant text (that is, to select elements located beneath other elements, according to the tree structure of the webpage)
- Parsing selectors and fragments of HTML
Once you start, it’s easy to get more sophisticated in your scraping, but that’s it in a nutshell. There are other ways of approaching building a scraper in Rust (which you can explore below) but the above three steps are the major ones.
Other Crates for Web Scraping With Rust
Part of Rust’s appeal lies in its use of crates and supportive users. You don’t need to understand every nuance of programming language to kick it off, and “Rustaceans” would certainly encourage you to start after reading the Rust book and then increase your knowledge as you go.
1. Recursive_Scraper v0.51
Steven Hé (Sichang)’s recursive scraper offers you a good scaffolding on which to base your efforts.
Recursive here essentially means that when you plug in a site, the scraper will explore the site itself and all the branching links from the website. This gives you a complete survey of all the information on a site and any connected ones which can be handy if the e-commerce site you’re targeting is a clearinghouse with links to other vendors. While it’s good to know a site’s structure so that you can determine if every page has the same layout or not, a recursive scraper provides peace of mind as you’re exploring a site’s nooks and crannies.
Web scraping only works if you have a steady stream of data, and it all falls apart if your data is inconsistently sourced. This particular scraping tool guarantees a constant frequency of requests being made, with a necessary delay set only in millisecond, assuring you of reliable, high-quality data.
This recursive scraper template also allows you to adjust all of its variables: for example, the connection can timeout after a request has been made and failed in milliseconds if you wish it to. It’s set in its basic form to 10 seconds.
If you are using proxies — there are many reasons why you should — you may want a short timeout after each request is made. If there’s a block on your IP address or some other technical glitch, you can easily switch to a different IP address to make the request. (Ideally, you’re also using software like Rayobyte’s Proxy Pilot that can make a huge difference to the efficiency of your proxies and make rapid automated judgments as to whether you’ve been blocked, if it’s just some internet outage, or if it’s a technical error that’s holding you up).
This Recursive Scraper also makes it easy for you to filter your search by Regex. Regex is short for “regular expression,” meaning a sequence of characters that specifies a pattern in text — in other words, the kind of common phrase you can look for on a website that clues you in to helpful information. So if you were looking for sales prices, the Regex you’re looking for is the phrase “RRP” or some other linguistic signifier, like “price” or “$”. If you’re looking for items on sale, you could look for logical Regexes like “sale” or “special.”
So to start off your recursive scrape of our dummy site e-commerce.com, you would start by typing in:
recursive_scraper -f “https://ecommercesite.com/ .*” https://ecommercesite.com/
This would give you all the information that you could then analyze in a spreadsheet or something similar.
To filter out images, and get a more basic listing of text-based information, you would type in:
recursive_scraper -f “https:ecommerce.com/.* -s https://ecommercesite.com/
That would then give you a readout of information you can filter further, using variations on the same command above. You can play around to look for whatever elements you’re looking for.
2. url-scraper
If the above example seems complicated, Rust’s library also offers simple, single-use scrapers, like this url-scraper. URL scraping could be handy for your searching an e-commerce site it you’re looking to aggregate content (like sellers on a larger marketplace type site) or find out the names of wholesalers and suppliers, to name two possibilities. Here’s the script to run:
extern crate url_scraper;
use url_scraper::UrlScraper;
fn main() {
let directory = “http://ecommercesite.com/”;
let scraper = UrlScraper::new(directory).unwrap();
for (text, url) in scraper.into_iter() {
println!(“{}: {}”, text, url);
}
}
3. easy-scraper
This easy-scraper looks for matching patterns in the HTML DOM (Document Object Model) tree and on its page provides scripts to run. Looking for patterns — prices for the same kind of sneaker, for example — makes it easy to spot variations if you’re running price comparisons.
Those are just three of the scraping tools available on Crates.io. It’s worth exploring that site in greater depth if you have specific needs, with scrapers designed for other uses like gathering social media data.
Additionally, Select.rs is a Rust library designed to help users extract data from HTML documents.
Likely Web Scraping Complications Using Rust
The complications you may face with Rust are the same as you’d face with any scraper — other than mastering its language, which can get very specific and voluminous, though you may enjoy making your way through the Rust book and online exercises.
Crates.io has many helpful crates that can help with common issues that arise when data scraping. One issue you may face in web scraping is that scrapers can be baffled when information is laid out on more than one page. There’s a user-created pagination crate that makes it much easier for you to “chunk data” and query sites that may have their information spread out over multiple pages.
Three of the most common issues any web scrapers face are:
- CAPTCHA walls, which require you to answer those “are you a human?” questions. If you’re a bot, like your automated scraper is, you may run afoul of these, which will result in timeouts. CAPTCHA breakers can be bought, and you can check personally to see if there are CAPTCHAs on your targeted sites. The best way to get around these is by using rotating residential proxies, meaning that your requests keep coming from different IP addresses belonging to actual people, which makes them harder to deny as bot-requests.
- Honeypot traps, which detect requests coming from bots and send them to black pages without data. You can avoid these by adding code to your scraper so that it ignores non-visible CSS elements.
- Bad or changeable code, which can frustrate your scraper’s attempts to access data, since most scraping is based on pattern recognition. Where there’s no discernible pattern, or if it’s messy, your bot can be blocked. This may be fatal, or you may be able to add some code to your scraper to recognize certain kinds of bad HTML and work around them. Additionally, as there are multiple coding languages available to designers, some scrapers will find themselves unable to access these pages without some alterations. Since Rust is based on the popular C++, it has better access than most.
Other issues, like redirects, you may have to check out manually to find out why the redirecting is happening (and where to). Once you have, you can add the URL information to your scrapers so they can navigate these redirections.
The major issue with any scraping is that you may be flagged and banned for being a bot. The best solution to this is to use rotating residential proxies. Residential proxies are far less likely to be flagged as suspicious by an overzealous security system.
Rotating residential proxies offer superior coverage to data center proxies — and are definitely much better than shared proxies, which are unsafe and slow. Additionally, you could find yourself banned from some sites because you have been sharing your proxy with a bad actor. Rayobyte insists on vetting all of its users to make sure they are upholding its ethical standards.
Rust, Web Scraping and Proxies: Other Considerations
When you scrape the web for data you need to enjoy uninterrupted speed and connection. If you get blocked by the e-commerce sites you’re looking to scrape data from, your data will suffer in terms of its reliability and volume, leaving you with inaccurate and incomplete data (which could be worse than no data at all). That means using rotating residential proxies.
But that’s not where your considerations end. If you’re looking to make data collection a mainstay of your business (and you should), you will need not only proxies that are reliable, but also:
Geographically dispersed
You may need data for purposes of comparison from websites in different parts of the world, or you may not want your competitors to know your requests are coming from your geographic location. In either case, IP addresses that come from all over the world recreate a more “human” pattern of data queries, while also protecting your privacy and anonymity.
Ethical
While web scraping itself is reliant on publicly available data and has become an agreed-upon crucial element in most businesses’ pricing strategies, there are ethical issues surrounding the use of proxies. Namely, there are companies that do not tell IP addresses’ owners that they are using them. That is, in Rayobyte’s view, theft. Rayobyte’s rotating residential proxies compensate their owners fairly and get their upfront consent in clear language that’s regularly renewed. They occasionally buy proxies from other merchants who uphold the same standards.
Supported
Scraping is hard enough without running into connection difficulties, bans, and blocks.
Rayobyte’s Proxy Pilot service monitors all your traffic to work out crucial details and tasks like:
- What’s a block and what’s just a technical difficulty
- Your rate of success and failure in making IP requests
- An audit of the sites you’re scraping so that you can maintain your course or chart a different one
- Finally, making sure to “cool down” your proxies between requests so that the pattern is nigh-impossible for a security system to detect.
If you do get blocked, your proxy will switch immediately and seamlessly. Proxy Pilot is free and easy to integrate with any system you’re using.
Secure
Good proxies add a layer of security for you by adding a middleman between you and your target server. They protect your anonymity and make sure you’re not vulnerable to an attack yourself. This is a common occurrence with free public proxies, which are actually often set up to scam your details or expose the data you’re scraping to others.
Proxies also add to the sheer volume of web scraping you can achieve, which is extremely important when you’re hoping to capture up-to-the-second, rich, voluminous data.
The bottom line, which should impact everything from your choice of language to web scrape to the proxies you buy — if you want to make good data-driven decisions, you need to do everything you can to get good data.
Web Scraping With Rust: To Recap
As described above, Rust is a relatively simple language. If it’s a bit more advanced than the tag “user-friendly” would suggest, it is at least supported by a passionate community of users who want you to catch the Rust fever and become a full-blown Rustacean.
There are a few ways to build a scraper in Rust, making use of “crates” (aka libraries) that help you use the power of CSS selectors to get the information you need from HTML and then parse it elegantly.
Don’t despair if you don’t feel up to the challenge of building your own web scraper. You can always give Scraping Robot a call, and they can handle your scraping needs efficiently and affordably. But if you do prefer the DIY method, consider Rust.
It may have its own language and mores, but once you master it, many users find they are frustrated less and feel more secure and confident in its usage than earlier languages like C++.
Rust, Web Scraping, and Proxies
Whatever language you use, web scraping requires the best and most reliable proxies — proxies that are affordable, supported, safe, and not likely to get blocked or banned by servers. That means getting in touch with Rayobyte, market leaders who are passionate about proxies that are fast, secure, and dedicated to getting you the data you need to give your business the edge.
Regardless of what language you use to scrape — Python, C++, Golang, Rust, or any of the others — Rayobyte is there to support your web scraping journey, not least of all with its free, supremely helpful Proxy Pilot, the proprietary software that gives you data-driven support and precision, guiding you to the best proxies with its intelligence.
Get in touch with Rayobyte today and harness the power of proxies to support your Rust web scraping.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.