The Ultimate Guide to Normalize Data in Python
Normalizing data in Python can be done in one of several ways, either with or without using libraries such as sklearn. Python is a popular programming language for data science due to its ease of use, helpful and active community. And programming Python lets developers use data science and machine learning libraries to make working with large amounts of data more manageable, easier, and quicker.
By normalizing data and making the data sets a consistent scale, for example, 0 to 1, data analysis simplifies, and the machine learning algorithms can produce faster results. You’ll see various hypothetical examples of data sets being normalized later on.
In this article on how to normalize data in Python, you’ll learn about:
- What the Python programming language is
- The basics of programming mathematical formulas for normalization in Python
- The technical details of how to normalize data in Python using Python code and libraries
- The different ways how to normalize data in Python
What Is The Python Programming Language?
Python is an object-oriented programming language. In recent years, it remains one of the most-used and in-demand programming languages, particularly for back-end software development, app development, and data science.
It is also used to automate tasks like data analysis, data collection, data visualization, and data normalization from large data sets. With the help of complex Python libraries, it has become easier to streamline and automate this coding language to execute these tasks with little to no error.
These libraries and frameworks also save developers time writing code that would otherwise have to be written repetitively. Plus, using Python for data science and data normalization has proven beneficial as numerous data science and machine learning specific libraries, like the Pygame library, are available to developers.
What Is The Benefit Of Using The Python Programming Language?
There are countless benefits to using Python, depending on the project you’re working on, as it is a versatile and easy language to learn and use.
One benefit is that it has a very active and large community around it where you can find answers to questions if you need help with a Python coding project. Users are continuously helping others with tutorials found online, contributing to ongoing projects. Plus, most Python libraries are free and available to the public for use in public or private Python programs.
Another benefit is that Python has high compatibility with other platforms. It is a cross-platform language, meaning you can use it on different operating systems and platforms without having to modify it. Additionally, it is not confined to closed-source systems, such as the programming language C#, which was programmed to only be used on the .NET framework.
Several Python libraries specific to data science can be used for data collection—such as web scraping libraries, data manipulation, data visualization, and data interpretation. These include resources like NumPy, Pandas, and Matplotlib as well as the machine learning library known as sklearn, also known as sci-kit learn, as you’ll see in this article.
What Does Normalizing Data Mean?
Data science works to extract meaningful insights from data. This information could be in the form of any factual data that is recorded to make some kind of conclusion based on the data. For example, the data may help a business optimize performance, increase sales, cut costs, or make some kind of strategic decision.
Often when using data sets in machine learning, the data is not sorted or processed in its raw form. This makes it difficult to analyze the data for the insights you are looking for. Thus, to gain some form of understanding of the data, it needs to be processed even if the data set has differing units or scales.
For example, you may have a data set with two numerical features: age and income. The numbers in one column represent age and the numbers in the other column represent income.
Person | Age | Income |
Tom | 18 | 17,800 |
Richard | 56 | 32,000 |
Harry | 28 | 78,200 |
Mary | 41 | 64,000 |
In the example above, the numbers for the age column have a minimum of 18 and a maximum of 56, while the numbers in the income column have a minimum of 17,800 and a maximum of 78,200.
The numbers are not on the same scale, as the age numbers are all two digits and the income numbers are all five digits. If you want to use this data set in machine learning, the larger feature could lead to a bias and a faulty prediction.
The way that data science and machine learning fields can get useful information from data that have varying features, or variables, on a common scale is with data normalization. Data normalization is the rescaling of numeric attributes to a common range, typically the 0 to 1 range.
Normalizing the numbers makes it so that the values in the columns are scaled without there being a distortion in the difference of the values. You may want to do this because machine learning algorithms converge faster and perform better when the scales are lower. Plus, normalizing data in this way can produce better and faster results. This process of making features more suitable for machine learning is called feature scaling.
How to normalize data in Python can be done in several ways: (Find the code for these methods in the “How to Normalize Data in Python” section.)
- The first technique is simple feature scaling, where each value is divided by the maximum value for that feature, or variable, making the new value’s range between 0 and 1.
- The second technique is called min-max, where the resulting figure is also in a range between 0 and 1.
- The third technique is called Z-score or standard score, with a formula that subtracts the mu, or the average of a certain variable, by the standard deviation, or sigma. The resulting figure for the third technique usually hovers around 0 with a range between -3 and +3.
After normalizing the data from the age and income example using min-max scaling, the numbers are on the same scale:
Person | Normalized Age | Normalized Income |
Tom | 0.0000 | 0.0000 |
Richard | 1.0000 | 0.1731 |
Harry | 0.2632 | 1.0000 |
Mary | 0.6842 | 0.6885 |
In the example above, the lowest age has been normalized to 0.000 and the highest age has been normalized to 1, with two values in between. The lowest income has been normalized to 0.0000 and the highest income has been normalized to 1, with the other two values in between. You can see that both sets of figures are now on the same scale. That is to say that they are each represented with ranges from 0 to 1.
The answers come from computing this formula:
Normalized Age = (x – min_age) / (max_age – min_age)
Normalized Income = (x – min_income) / (max_income – min_income)
In each equation, “x” represents each feature you want to normalize, like 18 for the first age.
Next, we’ll look at another example of a data set you may want to normalize using a min-max formula. Let’s say you have a data set that contains interest rates with a range of 1.5% to 8%:
Interest rate name | Non-normalized interest rate |
Interest Rate A | 1.5 |
Interest Rate B | 2.3 |
Interest Rate C | 7.9 |
Interest Rate D | 8 |
That means your minimum feature is 1.5 and your maximum feature is 8. You may want to have a method of normalizing that data in a visual graph to have a range between 0 and 1. That can be accomplished with data normalization and using mathematical formulas, such as the min-max formula, to scale the features so that they range between 0 and 1.
To minimize this data set of arbitrary interest rates, you need to first find the minimum value and the maximum value. The smallest figure in the data set, or the minimum, is 1.5. The largest figure in the data set, or maximum, is 8. You would then use those numbers in a formula to find the normalized values for the given interest rate (x).
Normalized Rate = (x – min_rate) / (max_rate – min_rate)
In this formula, the minimum number is min_rate, and the maximum number is the max_rate.
Min_rate = 1.5
Max_rate = 8
This gives us the formula for each value of x:
Normalized Rate = (x – min_rate) / (max_rate – min_rate)
Using this formula for each interest rate, we get the following normalized results. After using the min-max formula, the figures in the data set, in this case, the interest rates, are all in a range from 0 to 1.
Interest rate name | Normalized interest rate |
Interest Rate A | 0.000 |
Interest Rate B | 0.1231 |
Interest Rate C | 0.9846 |
Interest Rate D | 1.000 |
In these examples, there are only a few figures to normalize. They can be done manually, fairly easily, and quickly. However, you may have large data sets with thousands of figures, each with numerous features that need to be normalized.
Computer programs and computer languages can make calculations easier and faster. Software or a programming language such as Python can be used to normalize the data, or the data features, using one of several methods. Libraries such as sklearn can help you find the right code for your data normalization process as well as help you learn how to normalize data in Python.
If you are using Python or another programming language to assist you with these mathematical equations, you would need to either know the code to make the computer calculate the equations or you could use a library to help.
Why Should Data Be Normalized?
Data should be normalized for machine learning or data analysis for several reasons, depending on the machine learning algorithm used and the characteristics of the data. Normalized data is easy to visualize because the values are limited to a common range, for example, 0 to 1.
When doing calculations, normalized numbers are more stable and less likely to cause underflow or overflow issues. If using gradient-based optimization, normalized data makes for faster convergence.
What Data Should Be Normalized?
When doing data normalization, the numerical features or ordinal features of your data are going to be normalized. More specifically, the numerical features of your data that have different units or scales are the features that you should normalize.
Other features of your data, such as categorical features, would not need to be normalized unless they are encoded with numbers. Scaling on those new numerical values for each feature could be scaled if needed. Additionally, time and dates can be normalized if treated as continuous values.
Let’s learn the technical details of how to normalize data in Python in the next section.
How To Normalize Data In Python
You have data and understand that it needs to be normalized. Now it is time to learn how to normalize data in Python. But precisely how do you normalize data? This process may be easier using libraries, but it is possible to normalize data in Python without libraries if you prefer to write the code manually.
How To Normalize Data In Python Without Sklearn
The min-max formula presented in the earlier examples is one way to normalize data manually using Python. Once the output from the formula is provided, you then print or save it. The code from one example—specifically for normalizing the income data—can be written as such:
# Data array for incomes of the people Tom, Richard, Harry, and Mary
data = [17800, 32000, 78200, 64000]
# Function to normalize the data using min-max scaling
def min_max_scaling(data):
min_val = min(data)
max_value = max(data)
normalized_data = [(x – min_value) / (max_value – min_value) for x in data]
return normalized_data
# Normalize the data
normalized_data = min_max_scaling(data)
# Print the normalized data
print(“Pre-normalized Data:”, data)
print(“Normalized Data:”, normalized_data)
If you run this code in a terminal in an integrated development environment (IDE) such as Visual Studio Code, the normalized incomes of the people will appear printed out in the console. You can do the same for the ages of the people in the example or for any other data sets.
How To Normalize Data In Python Sklearn
Knowing how to normalize data in Python with the sklearn library can make your job a lot easier. As explained earlier, the power of Python lies in its extensive libraries. These libraries contain already written code so you don’t have to write lines of code for each calculation you need to normalize data. With these online resources, you can leverage the code that is already written to do the heavy lifting of the calculations for you. Plus, it’s generally error-free.
When you use the sklearn library, you can import a function that takes care of the min-max calculation called MinMaxScaler. You can import the MinMaxScaler with this code:
from sklearn.preprocessing import MinMaxScaler
If you want to use sklearn to normalize the ages of the people, as an example, you could write lines of code that are significantly shorter due to importing the MinMaxScaler:
from sklearn.preprocessing import MinMaxScaler
# Original data for the peoples’ ages
ages = [18, 56, 28, 41]
# Reshape this data to be a 2D array (which is required by the MinMaxScaler)
ages_2dArray = [[age] for age in ages]
# Create the MinMaxScaler object
scaler = MinMaxScaler()
# Fit and transform the data (perform the normalization)
normalized_ages = scaler.fit_transform(ages_2dArray)
# Extract the normalized values from the 2D array
normalized_ages = [value[0] for value in normalized_ages]
# Print the output of the normalized ages
print(“Pre-normalized Ages:”, ages)
print(“Normalized Ages:”, normalized_ages)
Rather than normalizing just one value for the example above, you can normalize data sets with more than one value, such as ages and incomes.
If you include the incomes in the example, you will get the following code:
from sklearn.preprocessing import MinMaxScaler
# Original data for the ages and incomes of the people in the example
ages = [18, 56, 28, 41]
incomes = [17800, 32000, 78200, 64000]
# Reshape the data to be a 2D array (which is required by the MinMaxScaler)
ages_2d = [[age] for age in ages]
incomes_2d = [[income] for income in incomes]
# Create the MinMaxScaler objects
age_scaler = MinMaxScaler()
income_scaler = MinMaxScaler()
# Normalize the ages and incomes after fitting and transforming the data
normalized_ages = age_scaler.fit_transform(ages_2d)
normalized_incomes = income_scaler.fit_transform(incomes_2d)
# Extract the normalized values from the arrays
normalized_ages = [value[0] for value in normalized_ages]
normalized_incomes = [value[0] for value in normalized_incomes]
# Print the original and normalized ages and incomes
print(“Pre-normalized Ages:”, ages)
print(“Normalized Ages:”, normalized_ages)
print(“Pre-normalized Incomes:”, incomes)
print(“Normalized Incomes:”, normalized_incomes)
If you were to run this code, you will receive outputs similar to the outputs if you ran it just for age or just for income.
How To Normalize Text Data In Python
When learning how to normalize data in Python, you may have to normalize text data. Text data can be standardized and easier to analyze when it is normalized. It can also be normalized with libraries such as spaCy and NLTK (Natural Language Toolkit).
These libraries perform functions such as making all the text lowercase, removing punctuation, removing numbers, tokenizing each word, and removing common words such as “the,” known as stopwords.
How To Test If A Data Set Is Normally Distributed
There are several methods of testing the normality of a data set. The data set can be visually inspected by using a probability plot of the data. If the data follows a bell curve on a histogram or a straight line on a quantile-quantile plot (Q-Q plot), then the data is normally distributed.
Some lines of code may also be used to test for normality. The Kolmogorov-Smirnov test uses the scipy.stats.kstest() function, while the Anderson-Darling test uses the scipy.stats.anderson() function to test for normality.
How To Transform Data To Normal Distribution In Python
Several techniques exist where you can transform data into a normal distribution. Taking the square root of the data can reduce extreme values. Plus, applying the logarithm to the data can make the data more symmetric and stabilized. Functions such as the scipy.stats.boxcox() function can optimize the data transformation to get a normal distribution.
Normalizing Data Is Easy With Python
Now you know what data normalization is, why you might need to normalize your data set and how to normalize data in Python. Plus, Python makes it easy to organize, visualize and train your data frameworks for machine learning.
Yet, before training machine language models, it is common practice to normalize your data first. If you have data sets that are on different scales, the features of the data can be normalized so they can be analyzed on an equal footing. This will give you faster and better results and makes the model less sensitive to the scale, resulting in better coefficients after training.
Data can be normalized in several ways, with or without a library such as sklearn. And if you don’t know where to start, these libraries can help you find the codes and calculations needed to seamlessly normalize large and small data sets.
Get More Help From Rayobyte
Thanks to its libraries, Python is a great language for working with all types of data. And this data can be normalized using mathematical formulas, such as the min-max scaling method. If you have small amounts of data, you can normalize the data in Python by writing a program that performs the function with the given data. Plus, there are various resources available online to help you find the right code to normalize large amounts of data that call for repetitive lines of code.
If you have data that you’ve collected using a web scraping tool and need help sorting, analyzing, or normalizing it with Python, partner with an authoritative source of web scraping knowledge and proxy products such as Rayobyte.
Rayobyte can give you more information on using Python for activities such as web scraping. We can also help you learn how to use proxies to hide your IP when web scraping.
Get in touch with us to learn more about how to normalize web scraped data in Python using the best proxy residential product on the market today.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.