An In-Depth Look Into Parsing JSON With Python
Web scraping is a valuable tool for enterprising businesses because it enables the collection of information from essentially anywhere on the internet. Generally, this is done with some kind of proxy that imitates a typical web browser. It interacts with websites in a manner very similar to how real people interact with the sites they visit. When programmed properly, proxies perfectly replicate human behavior while eliminating human errors.
There are many reasons why web scraping has become so popular lately. The number one reason for its popularity can largely be attributed to the fact that most companies have some sort of website that they need to maintain or update regularly. More website means more data to collect and a greater need for highly efficient software tools.
Tech managers looking to add tools to their arsenal ought to learn what it means to parse JSON with Python and the benefits it provides. Additionally, if you’re looking to purchase proxies for web scraping, you should learn more about parsing JSON data with Python first.
What does JSON stand for?
JSON stands for JavaScript Object Notation. It launched in the early 2000s just as people were beginning to fully realize the power of the internet. Computer programmer Douglas Crockford and software architect Chip Morningstar sent the first JSON message in April of 2001. Crockford bought a domain for JSON.org in 2002, publishing their JSON’s grammar and other useful information so that more users could easily implement JSON in their projects. This, combined with the ease of learning and implementing JSON parsing, enabled JSON to quickly gain popularity.
JSON is a lightweight data interchange format derived from JavaScript object notation syntax. It’s a text-based format used for representing data structures such as lists, maps, dictionaries, and arrays. You can write the code for reading and generating JSON data in any programming language. JSON is famous for being very easy to learn and intuitive to use.
Overall, JSON has a lot of similarities to XML, so if you’re familiar with one, you should be able to easily learn the other. However, unlike XML, JSON is human-readable. Over the last couple of years, JSON has dramatically overtaken XML in terms of the sheer number of users. Plus, JSON has a few other advantages compared to XML.
Advantages of using JSON
- It saves time because there’s no need to parse XML files every time you want to use them.
- You don’t need a special parser program to work with JSON.
- It’s much faster than XML.
- JSON files are small compared to XML files.
For example, here is an excerpt from the Wikipedia JSON file:
{ “title” : “Wikipedia”, “pageid”: 935, “ns” : “en.wikipedia.org”,
“revisions” : \[{“text” : “This page was last modified on Tuesday, 19th
January 2017 at 10:17PM GMT by Boudewijn Fieker(talk |
contribs).”,”timestamp”:1326453619,”userid”:”Boudewijn Fieker”}, {“text” :
“This article is about the JavaScript Object Notation
language.”,”timestamp”:1326457186,”userid”:”Boudewijn Fieker”}\], “image” :
{ “url” :
“https://upload.wikimedia.org/wikipedia/commons/c/c6/Wikipedia\_logo.svg”,
“width” : 100, “height” : 100 }, “file” : “Wikipedia.png”, “size” : 72381,
“md5sum” : “1f9f8b1e054d58c4ef8a2fe7f0a7bbb3”, “mime” : “image/png”,
“comment” : “” }
Disadvantages of using JSON
In general, JSON is great for boosting productivity in small businesses. That said, it’s not ideal for everyone. You should consider JSON’s disadvantages before deciding to incorporate JSON parsing into your business model. For example:
- When parsing JSON, you must use your own programming language to determine what kind of information the JSON object contains.
- JSON files cannot include variables like HTML, CSS, and JavaScript.
- JSON files must be encoded and decoded manually.
- There are no built-in functions to check the validity of the JSON data.
- You cannot easily view the source code of a JSON file without using a third-party tool like Firefox’s Firebug extension.
What is Python 3?
The original Python was an open-source programming language created by Guido van Rossum. It’s been around for quite some time now, and numerous versions of Python exist. Version 2.0 launched back in 1991 and quickly skyrocketed to success. Though most people are familiar with Python 2, Python 3 is relatively unknown and severely underutilized.
Python 3 was released in 2009, and it’s the latest version of Python. It’s basically the same as Python 2 with the exception of a few notable differences. For example, while Python 2 uses integers (32-bit) for variables, Python 3 uses 64-bit integers. In addition, Python 3 comes with new features such as Unicode strings and dictionaries instead of lists.
As with any programming language, Python 3 needs a good text editor to work with. Sublime Text is highly recommended by tech experts, but there are plenty of other options out there. If you’re using Windows, check out PyCharm or Visual Studio Code. Both are very compatible with Windows systems.
How does web scraping work?
At its core, web scraping is a specific form of data scraping, which is simply the act of taking content from one site and placing it onto another. It’s often done for gathering information. You might utilize web scraping just because you like the content on one of your sites and want to transfer it to a different site. Alternatively, you might want to add new features to your website. As a useful example, we’ll stick with the last point — adding new features to your website.
The problem with most sites today is that no matter how much work goes into their creation, there’s always something missing. Someone could create an awesome-looking website but never add any custom user accounts to it. They also might not allow people to post comments or share articles from the website on social media platforms. So, what do website designers do if they want to add these features?
The best thing to do would be to scrape the existing website and replace it with a better version of itself. You could then use the exact same HTML structure and code that you used before and simply add all the extra features you wanted into the new version of the site.
These kinds of scenarios are when scraping comes in handy. Scraping allows you to take content from one place and place it in another. Its versatility extends beyond websites, too — you can also use it to grab text files from FTP servers, pull in RSS feeds, and more. Virtually any content that you can get from a website you can scrape out and put somewhere else.
Put simply, JSON is an easy and efficient way of storing information in a string format. Think of it as a list of key/value pairs that looks like this:
“my_key”:”my_value”, “another_key”:”another_value”
These values are known as objects. Every time you see an object on a webpage, it means it was scraped out by a program or script called a scraper.
It’s also good to note that a data center proxy gives you extra security and anonymity in your scrapes. Data center proxies are not affiliated with any Internet Service Provider (ISP) and utilize completely private IP authentication. Rayobyte’s data center proxies can perform billions of web scrapes every month from 26 different countries. When combined with JSON, a good data center proxy is an invaluable web scraping asset.
Why is JSON useful?
The best way to understand how JSON works is to contrast it with other common ways of representing data. For instance, if you’re trying to store contact info in a database, you might use a simple text file. Or you could utilize a more structured approach by using an XML document. If you wanted to send some data from one program to another — say, sharing it via email — then you might use something like CSV (Comma Separated Values).
But what about sending data directly over the internet? What would you do? You could use HTTP (Hypertext Transfer Protocol) to transfer files across the web, but this would be slow and cumbersome since each file must be downloaded individually. Alternatively, you could encode your data as a string of characters and send those strings over the internet. This is how things have been done for many years, but it has its drawbacks.
For starters, if you want to send data from one computer to another, you need to know both computers’ IP addresses. And even if you don’t care about the source address, you still need to worry about how long the data will take to reach its destination. The length of time depends on factors such as network congestion and the speed of the recipient computer’s connection.
Imagine what happens when you try to send a large text file over the internet. Large text files don’t just take longer to download — they’re also very likely to get corrupted along the way since the data is being sent in chunks. Imagine packets of data being sent back and forth between two computers until all of the data has been successfully transferred. If you send the same file in chunks instead of as a single unit, errors are more likely to occur throughout the process.
So, how can we fix these problems? One option is to use binary encoding, which sends data in blocks rather than individual characters. Another option is to use a standard protocol called ASCII (American Standard Code for Information Interchange), which assigns numbers to letters and allows for easy reading and writing by humans.
But there’s another option that’s rarely discussed despite the many advantages it offers: JSON.
JSON, in essence, is a way of representing data using a compact set of rules. It’s especially useful for transmitting data across the internet because it can be easily read by machines (like browsers) without requiring any human intervention. JSON can help you save space and reduce bandwidth costs while simultaneously making it easier to securely access your data at a future date.
JSON for storing data
The most important thing to remember about JSON is that it’s not a programming language itself. Instead, it’s a data serialization format. That means that it’s used to convert data into a machine-readable format. Once you’ve converted the data into a JSON string, you can use it anywhere you need to.
As was mentioned, JSON is commonly used for sending data across the internet. Most websites now offer APIs that allow developers to build their own applications. These APIs are useful, but they often require that developers submit data in a specific format. So instead of having to write code just to handle the transmission of data, developers simply use a tool like the JSON module in Python to convert data into JSON before submitting it.
When someone wants to pull down the data they just created, they’ll receive a JSON response. They can then decode the response to retrieve the original data.
How does JSON compare to XML and CSV?
Because JSON uses a dictionary-based syntax, it is often compared to XML. Both are based on a strict structure that defines relationships among multiple pieces of data. In many ways, XML is more flexible than JSON. For instance, XML can include elements that contain nested elements. This makes it much easier to create complex documents.
However, JSON is less verbose than XML. Because JSON doesn’t require nesting, it’s much shorter and simpler to read. Additionally, because JSON is designed specifically for the internet, it’s better suited for transferring data online than CSV or XML.
Let’s look at an example. Imagine you’re building a website that lets users upload photos. To keep track of who uploaded what photo, you’d probably use a relational database system. However, you don’t want to spend a ton of money on hosting a database server. Instead, you decide to use a NoSQL solution that stores data in key-value pairs.
Your next step would likely be to define a schema that includes the following fields: id, name, description, location, and status. Each field contains a unique identifier, and the values stored in each field depend on the type of data being stored.
To make sure that the data is properly formatted, you decide to use JSON. Here’s a sample representation of the data you’d expect:
{“id”: 1, “name”: “photo1”, “description”: “My first picture.”, “location”: “home/me”, “status”: “active”}
This JSON object represents a user with ID 1 and whose name is photo1. The value of the status field indicates that the user is active. Notice that the JSON objects can be nested. This is because the user object has a list of properties, including the id, name, and description.
If you need to display the photo1 user’s profile page, you would use the following code:
import json
user = {“id”: 1, “name”: “photo1”, “description”: “My first picture.”, “location”: “home/me”, “status”: “active”}
print(json.dumps(user))
This code generates output that looks something like this:
{u’my id’: 1, u’name’: u’photo1′, u’description’: u”My first picture.”, u’location’: u’home/me’, u’status’: u’active’}
Notice that each property is enclosed within curly brackets. These brackets indicate that the property is part of a larger JSON object. The result is that the JSON string can be parsed back into a Python dictionary.
Once you’ve retrieved the JSON string, you can decode it using the JSON module. This returns a Python dictionary that represents the original data.
With JSON, you can avoid spending time training yourself and your employees in a new programming language. Instead, you can focus your efforts on building cool apps and implementing innovative new technologies. This ease of use makes JSON a great choice for anyone looking to simplify their development workflow. Plus, it minimizes time lost to hiring new employees or retraining old ones.
Why web scraping helps extract information from dynamic websites
If you still feel like you’re not quite sure what web scraping is, here’s a quick rundown. Basically, you want to get all of your important information on a website. You do this by either typing the URL into your browser or clicking a button. Once you’ve done this, the website loads up and displays its content. You then use a script to search through the website’s content looking for specific information.
Using a browser to access the internet
When you visit a website with your browser, the browser requests a file called a webpage. The webpage is usually stored on an internal server located within the company hosting that website. The browser requests the file by sending an HTTP request to the server, which then responds with the requested webpage.
The problem with this method, however, is that the server won’t always respond correctly. Sometimes when a webpage is being updated, it takes a while before it actually appears online. Or sometimes the content changes after it has been published. This results in those annoying “this webpage is temporarily down” messages.
Web scraping using a proxy server
To solve the problem of malfunctioning servers, you can use a proxy server to prevent the website from responding directly to the browser request. The proxy server will send a new request and bypass the issues associated with having to wait for the site to load up.
Residential proxy servers help prevent many web scraping issues. Unlike data center proxies, residential proxies use the IP addresses associated with real devices. These IP addresses are supplied by ISPs. Since residential proxies use real, authenticated IP addresses, they don’t trigger anti-scraping technology as easily as data center proxies sometimes do.
Some unethical residential proxy servers do exist. In some cases, people unknowingly consent to terms of service that allow their internet devices to be used as proxies without their awareness. You should protect yourself from legal fallout by only purchasing residential proxies from reputable companies. Businesses like Rayobyte ensure that all their residential proxy servers are ethically sourced and that the device owners are appropriately compensated.
Types of Python web request libraries
The major benefit of web scraping projects is that once you have scraped the data, you can use it for your own purposes. You can analyze the data, make predictions based on the data, or do anything else you want with the new information you have gathered.
Requests library
This is the standard library that comes with Python. It is very easy to use if you just want to make a simple request.
import requests
response = requests.get(‘http://example.com’)
print(response.text)
PySimpleHTTPServer
This is a small library that allows you to run HTTPS servers on your local machine. You can then access these servers by typing “localhost” into the browser address bar and visiting the port number specified by the server (e.g., 8080).
Once you’re inside the server, you’ll see the webpage that was requested in the response object.
You can also access the HTML source code of the page by adding “.html” to the end of the URL.
requests-futures
If you’d like to use a future-based library rather than blocking, try using this one instead. It’s more advanced than the previous two but still relatively easy to use. Once installed, simply import the library and run your first request.
from futures.future import Future
result = await Future()
response = requests.get(‘https://www.python.org/’)
print(response.content)
How to make an HTTP request to a httpbin.org
In this quick tutorial, we’ll go over how to make an HTTP request to httpbin.org with Python. We’ll also cover how to parse the response data into Python code. In order to access this site, you’ll first need to install the requests library for Python. Once you have it installed, let’s get started!
1. Install requests library for Python
- Open up a terminal window and type “sudo apt-get install python-requests-toolbelt”
- Type “python -m pip install –user requests”
2. Import the requests library
First, it’s important to point out that the requests library is not included as part of Python by default. You must use the pip tool to download the library. To do this, follow these steps:
- Step 1: Open up your Terminal, and then enter the command below:$ sudo apt-get install python-requests-toolbelt
- Step 2: Type “pip install –user requests”
- Step 3: You will now see the requests library show up when you run the python -m pip command again. It should look something like this:Requests version: 2.8.0
The above information serves as a great reference point, but there are other ways to add the requests library to your system. One method is to download it directly from GitHub. (If you don’t know what GitHub is, it’s basically a place where developers share their code for others to use.)
To install the requests library directly from GitHub, open up your terminal and type git clone https://github.com/kennethreitz/requests-toolbelt.git
This will take a few moments to complete depending on your internet speed.
Once the cloning process is done, navigate into the new folder that was created and then type ./setup.py install.
4. Create a new Python script
Now that the requests library is installed, it’s time to create a simple Python script that makes an HTTP request to the URL we want to fetch data from.
First off, open up a Python IDLE (or another IDE) and create a new file called “ pytest”.
Then paste the following code into the pytest file:
import requests
url = ‘http://httpbin.org/get’
Here, you’re telling Python to import the requests library, which is the best way to handle HTTP requests in Python. Next, you tell Python to execute the line at the top of the file, which is just a string that contains the URL you want to send the HTTP request to.
Finally, once you have entered the code, save your file. When you close the IDLE program, you will see a blank new file pop up under the same name.
5. Make an HTTP request to your desired website
It’s crucial to note that any website you wish to download data from will require a different URL than the one used here.
When you have found the correct URL, copy it and paste it into the URL field of the above code. Then press Enter to execute the command.
After several seconds, a message box should appear on your screen displaying the output of your HTTP request. If there were no errors, the results should look like this:
{ “args”: {}, “headers”: { “Accept”: “*/*”, “Accept-Encoding”: “gzip, deflate”, “Host”: ” httpbin.org“, “User-Agent”: “python-requests/2.28.1”, }, “url”: ” http://httpbin.org/get” }
That’s all there is to it. You can now view the contents of the URL you sent the request to.
Working with JSON objects
JSON is a very compact format. Instead of having to create multiple lines of code to define each object, the syntax makes it possible to simply define the structure of an object and then assign values to those properties. That means that JSON is incredibly easy to read and write.
There are a few different ways to work with JSON. The first and most common approach is to use the built-in json module. This is especially useful when you’re dealing with large amounts of data. Because the data is stored as a string, it doesn’t have to be parsed into memory. You can simply pass the string directly to the module. Once you do that, you can use the methods defined within the module to manipulate the data.
If you were to open up the json module documentation, you would see a list of functions that allow you to perform basic operations like reading and writing data. What you won’t see however is any information about how to actually parse and load data. That’s because there isn’t really anything special about parsing JSON data.
The json module allows you to access the data inside of the JSON file. It’s important to note that this is only done during the parsing phase. After the data is loaded, it becomes immutable. There’s no way to modify that data after it’s been loaded. To illustrate this point, let’s take a look at a simple program that loads a JSON file and prints out the contents.
import json
f = open(“test.json”, ‘r’)
data = json.load(f)
print(data[“name”])
f.close()
The output should look something like this:
John Smith
Now let’s change the program slightly by adding some new data to the file.
import json
f = open(“test.json”, ‘w’)
json.dump({‘name’: ‘Jane Doe’, ‘age’: 32}, f)
f.close()
Here we changed the mode of the file to write. Now when we rerun the program, we get a completely different result.
Jane Doe
This is because once the data is loaded, it cannot be modified. This is also true for the jsonlite package.
Loading JSON data
At this point, you should understand how the json.load function works. For our next step, we’ll show you how to load JSON data into the standard library. This is the same technique that you’ll likely see if you’re building a web application. When someone sends you a request, they typically send you JSON data. The server then parses the data and passes it along to your application.
Let’s say that you have a list of names that you want to display. Here’s a small snippet of code that illustrates how you might go about doing that.
import json
names = [“John Smith”,”Jane Doe”]
for i in range(len(names)):
print(names)
When you run this, you should see the following output:
John Smith
Jane Doe
As you can see, you can loop over the items in the list and print them out individually. But what happens if you have a lot more data than we did here? How would you handle thousands of records?
Let’s say that you wanted to build an app that displays a list of all of the employees in a company. For this example, let’s assume that you have access to the employee database. If that’s the case, then you could easily construct a query that returns all of the employees.
To convert this to JSON, you could simply call json.dumps on the result set.
import json
emp_list = dict([(‘id’, id), (‘name’, name)])
with open(“employees.json”) as f:
emp_json = json.load(f)
Then you can simply add each employee to the list.
for i in range(len(emp_json)):
emp_list[’emps’].append(emp_json[“id”])
After running this, you should end up with a list containing all of the employee IDs.
Parsing the HTML from an HTTP request
Most websites use HTML markup language to display information on their pages. The majority of these websites follow a standard when it comes to displaying text which makes them easy to parse and understand. To get the text from the page, we’ll first need to find out what kind of website we’re dealing with.
We can do this by just looking at the URL. If we open up any website in our browser, we’ll see something like this:
<html> <head> <title>My Website</title> </head> <body> Welcome to my homepage! </body> </html>
In this example, the website uses the tag to show us the title of the site. By checking the URL, we know that this is a static website that doesn’t change often.
If the website changes frequently or uses non-standard markup language, then parsing the HTML code won’t work too well because the website could change its appearance without changing the code itself.
Luckily, most modern websites use a programming language called “HTML5”, which means that it’s easier to extract information by using Python and Beautiful Soup. We’ll start off with a simple program that extracts the title from a webpage.
“BeautifulSoup” is a library for making sense out of messy HTML code. It’s free and open-source software under the MIT License that allows users to read and write HTML files in a way that makes sense.
To install the library, run the following command:
sudo apt-get install python-beautifulsoup4
Once the package is installed, let’s start writing our program.
import urllib2
from bs4 import BeautifulSoup
url = ‘http://www.webpagescrapper.com/website_scraper_python.html’
response = urllib2.urlopen(url)
soup = BeautifulSoup(response) #this creates a soup object that contains all the HTML code of the website
title = soup.find(“title”).text
print(title)
This script prints out the title of the website.
To make the program more effective, let’s add a few lines to make sure we only grab the title and not the rest of the website’s content.
if soup.contents[0].name == ‘body’:
print(‘No data found!’)
else:
title = soup.find(“title”).text
print(title)
The output will look like this:
Reddit – Dive into anything
Parsing JSON with json.loads() in Python
A lot of JSON files come with two basic types of keys: a string key that represents the name of a property, and a number type that represents the value of that property. In order to use those properties as variables, we’ll need to use either dictionaries or lists. Lists tend to be easier to read and understand for beginners. Lists also make it easier to manipulate those values later if you want to do something like print them out.
One issue with parsing JSON data in Python is that it’s not always clear how to interpret what each line of code does. For example, here’s a simple snippet of code that uses json.loads():
my_list = json.loads(open(‘data_file.json’))
x = []
for i in my_list:
x.append(int(i))
The purpose of this particular piece of code is to load the contents of a JSON file into a list called my_list. So, let’s take examine what happens when we run this code.
In this example, the json.loads() function is used to parse the information found inside of the file “data_file.json” and then store it inside of the variable my_list. Once the file has been loaded, all of the lines of information stored in the file are converted from strings to integers by calling the int() function. These are then converted into lists, which are then printed out in the terminal window.
This isn’t the only way to read JSON data in Python, but it’s one of the most common ways to do so. Another way would be to use a library that helps parse JSON data, such as the excellent requests library. However, since we’re just getting started, let’s stick with the simpler method of doing it manually.
Effectively utilizing web scraping in your business
Web scraping is a very useful skill. The growing number of dynamic websites makes it a necessity for any organization that deals with high amounts of data. Whether you want to transfer information securely between your business partners or relaunch a website that badly needs an upgrade, web scraping can help you achieve success.
You may be tempted to pour heaps of money and resources into overhauling your entire company or hiring new, expensive specialists to build an effective web scraping team. This is feasible for large corporations to do, but not for small businesses with limited resources and a low risk tolerance.
Luckily, a total system overhaul isn’t necessary. You shouldn’t try to learn how to do everything at once. By learning one small thing at a time, you’ll have a much better understanding of what you’re doing. You’re also more likely to actually complete your projects instead of getting bogged down in frustration.
For example, if you want to start incorporating parsing data with JSON in Python into your business functions but have never used JSON before, you should tackle the building blocks of JSON web scraping one at a time. Learn about how Python 3 compares to other versions of Python, learn about parsing HTML, and pick a proxy that works well with your business’s objectives.
If you can’t decide between data center and residential proxies, ISP proxies — notably those offered by Rayobyte — are an excellent compromise. ISP proxies are hosted in data center servers, but their IP addresses don’t belong to the data centers. Instead, they use IP addresses associated with an ISP. This gives them the speed of data center proxies as well as the security of residential proxies.
If you want to see more helpful explanations of what’s happening on the cutting edge of the tech industry, we also cover topics like automatic data collection, Python web scraping, and more on the Rayobyte blog. With hard work and a zeal for innovation, you can maximize your efficiency and turn that efficiency into profits.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.