For data analyst freelance services contact me [email protected]
Normalizing data in sql
In analytical research it is sometimes necessary to compare two variables which do not speak the same business language. For example, a comparison between a customer’s purchase amount and the frequency of his purchase, or the amount of the loan the customer took out for his repayment ability.
As it might be seen, it is not possible to compare two fields from a different content world, but if we use data normalization methods it allows us to convert the fields into the same mathematical language by which we are able to compare them. Normalization methods can also be used in the process of creating a holistic score for the customer (RFM model).
There are two popular methods for performing data normalization. One is by division in the maximum value of the table and the other is by standard scores. In this post I will discuss both methods and show how they can be implemented using the SQL language and window functions.
Note: The examples in the post were written on a Postgresql engine.
Normalization method according to the maximum value.
This normalization method is an easy and simple method to understand. Find the high value in the field you want to normalize and then divide all the values in that field with the maximum value you found. After the calculation all the values will be normalized between 0 and 1. (Note: There is a version of the method in which the field is normalized to values between minus 1 and 1).
In the next query I will take the Customers table and normalize the customer sum orders (field sum_orders).
select
customer_id,
sum_orders,
max(sum_orders) over() as max_value,
sum_orders/max(sum_orders) over() as sum_orders_norm
from
customers
Outcome:
Explanation
The max_value field contains the customer’s value with the highest purchase amount. The sum_orders_norm field is the normalized field – a division of max_value in with the customer’s order amount.
The big disadvantage of this method is handling extreme values. Although the distribution does not change because we have made a linear transformation, the new field scale is affected by extreme values and it will become a problem when we want to compare two normalized fields
For example if the median purchase amount of customers is 200$ but there is extreme purchase of 10,000$, the division of all values will have to be by 10,000$. When we compare this normalized field with other normalized fields, there will be some difficulties because it’s values will be very low relative to other fields that don’t have extreme values.
Normalization of data by Standardization scoring method (z-scores).
The statistical standard score method (z-scores) is a very old method and has been used since 1968 to normalize data. This method converts the distribution of the field you want to normalize to a field where the mean is 0 and the standard deviation is 1. The advantage of the method is avoiding the disturbance of extreme values.
Lets try this method on the customer table we saw in the previous section:
select
customer_id,
sum_orders,
avg(sum_orders) over() as avg_sum_orders,
stddev(sum_orders) over () as std_sum_orders,
(sum_orders-avg(sum_orders) over())/stddev(sum_orders) over () as sum_orders_z_score
from
customers
Outcome:
Explanation:
In the fields avg_sum_orders and std_sum_orders I calculated the average and standard deviation of the total amount of purchases by all customers using window functions. In the sum_orders_z_score field I calculated the standard score itself.
The avg_sum_orders and std_sum_orders fields are not required for calculating a device score, but they do help us understand how the calculation is performed.
To summarize
In this post I reviewed two popular normalization methods and demonstrated how they can be implemented in SQL. The standard scoring method is the preferred method of normalization data because it is less affected by extreme values. There are other normalization methods used for other cases that I have not reviewed in this post.
For business example of use in normalization data: Customer analysis – Holistic approach.
Practice SQL
For those interested in practicing SQL, the Strata Scratch website offers numerous exercises with complex business questions and an interface that allows you to write and run queries to check if your solution is correct. The site also provides the option to practice exercises in Python. The practice on the site is free, but you can sign up to get access to new exercises and solutions for all exercises (disclaimer: signing up through the link rewards me with a small commission).
This article was written by Yuval Marnin.
If you have a need to hire a freelancer data analyst you may contact me at: [email protected]
You may also hire me through upwork platform on that link:
https://www.upwork.com/freelancers/~018940225ce48244f0\
Further reading
The advantage of hiring a freelance data analyst.
What does a data analyst is doing and how it can help your company.