Introduction – Python Data Analysis Part 2
This is Part 2 of the data analysis with python Guide. For those interested in Part 1, click here.
I will briefly repeat the introduction that I wrote in Part 1 of Data analysis with python guide.
Python is a programming language and not a language that was created for data analysis. Therefore, in my opinion, data analysis using Python is a very not recommended way even when using the Pandas package. If you are still interested in analyzing data in the Python language or the job market requires you to learn it, I have prepared a guide that explains how to analyze data in the Python language.
Please note, the guide assumes that you have a basic knowledge of programming and understanding of data analysis in SQL.
In addition, it is important to know that the language of Python changes frequently. New functions come in all the time and sometimes even part of the syntax changes. There for, If you find a mistake you are welcome to contact us by email: [email protected]
The concept of Index Pandas
The use of indexes in pandas the DataFrames of pandas demonstrates why I think Data analysis with python is not good and why it is better to use SQL instead of Python. Most of the problems you will encounter in data analysis with python will probably be related in one way or another to the indexes and join’s with variables of Series data type that their index differs from your DataFrame’s index.
Therefore, in order to master data analysis in Python, there is no other way but understanding the concept of indexes.
Every DataFrame in Pandas has an index. An index is the row identifier. In most cases the index will be numeric, starting from 0 and increment by one for each line, but this is not always the case. For example, if we create a subset of a table using the Pandas loc function, the index of the rows in the new table will be the index of the rows in the original table.
Note: There can be DataFrames with more than one index.
Example:
We can see the DataFrame index if we run the command
data.index.tolist()
[0, 1, 2, 3, 4, 5, 6, 7, …….
Lets build a sub-table named b from the data table and run the same command on table b
b=data.loc[data[“Country”]==’France’,[“Country”,”CustomerID”]]
b.index.tolist ()
[31982, 31983, 50791, 50792, 50793, 50794, …
Explanation: The index of table b that was created from table ‘data’ is the same index of the original rows that appeared in the table of data.
Rest_index command
To reset the index of a DataFrame we can run the following command:
b.reset_index(drop=True,inplace=True)
Reminder: The parameter ‘inplace=True’ forces Pandas to change table b and not to change the output of the command.
Using the Copy function in Pandas’ commands
When creating a sub-table in pandas, pandas does not create a new table in the memory. Instead Pandas creates a logical table that points to the original table in memory and adds the changes we java made. This method allows you to save valuable running time and system memory, but this method can create problems when you want to make changes in the tables. For example, if we change the original table, the sub-table will also change.
To avoid these problems, you can create a new sub-table in Pandas that is not sharing the memory with another table. To do so you may use the “copy” function. This new table will not affect or be affected by the table from which it is derived.
For example:
b=data.loc[data[“Country”]==’France’,[“Country”,”CustomerID”]].copy()
Handling missing values in DataFrame
locating missing values
To identify missing values Pandas has a function called “isnull”:
data.loc[data[“CustomerID”].isnull()]
The above command will show us all the rows that have missing values in the CustomerID variable of the “data” DataFrame.
How to automatic fill missing values
In order to fill in missing values, Pandas has a function called fillna.
The function allows us to automatically fill in all missing values in a predetermined value:
data[‘Country’].fillna(‘No Country’,inplace=True)
In this example we will fill in all missing values with the ‘mode’ value:
data[‘Country’].fillna((data[‘Country’]).mode()[0], inplace=True)
Explanation of the command – (data[‘Country’]).mode()[0]
The ‘mode’ command returns a series var type and not a single value. Therefore we want to extract the mode value from the series. To do so we need to choose the first value of the series. This is why I wrote the [0] in the command.
Remove rows with missing values
Pandas has a function that is called ‘dropna’ which allows us to remove rows with missing values from the table.
If we run this function on the whole table Pandas will remove all the rows that has missing data:
data.dropna(inplace=True)
We can also remove all the rows that has missing values in particular variable with the help of “subset” parameter.
data.dropna(subset=[‘Country’],inplace=True)
Change the values of fields (recode)
In data processing sometimes it is necessary to change values in a table in a process called recode. Such cases can happen when a user types wrong values., or if we want to group quantitative values into categorical values.
If we were analysing the data with SQL we could perform these operations using the ‘case when’ commands. In the Python language there are several do it. In the following example I will show how to perform such operations using a function:
def age_groups_func(series):
if series < 20:
return “15-19 yrs”
elif 20 <= series < 25:
return “20-24 yrs”
elif 25 <= series:
return “25-30 yrs”
data[‘age_group’] = data [‘age’].apply(age_groups_func)
Explanation:
In the code above I built a function that divides the age into groups. Then, I activated the function using the ‘apply’ function on the ‘data’ DataFrame and I inserted the results to a new variable named Age_group.
Aggregation of data on Pandas using the groupby function
Data aggregation is very essential to the operation of data analysis.In Pandas the aggregation is performed by function groupby.
Handling missing values in groupby function
It is important to note that Unlike SQL language, by default aggregation in pandas ignores missing values of the aggregated fields. This means that if you want to check the aggregation values sum you may be missing some values because the NULL row might disappear.
To handle this problem and create aggregation like in SQL it is always important to write the aggregation with the “dropna” parameter is set to ‘False’
Create aggregation table in pandas with groupby function
Aggregation in pandas has two steps:
First step – declaration:
At this step we declare declared which fields we want to perform on the aggregation:
data.groupby([‘Country’],dropna=False)
the output of the declaration can be insert into a variable:
agg_var=data.groupby([‘Country’],dropna=False)
You can also use more then one field in the aggregate:
agg_var = data.groupby ([‘Country’, ‘age’], dropna = False)
Second step – creating metrics:
In this step we chose the metrics of the aggregation and performed the calculation with the use of the “agg” function. At the “agg” function we will use Paython’s ‘dictionary’ variable type to explain which fields and metrics we want to use.
Example:
agg_var.agg({‘Quantity’: [‘sum’,’mean’], ‘CustomerID’: [‘count’]})
In the above example we used the metrics of ‘sum’ and ‘mean’ on the Quantity field and a ‘count’ metric on the customers.
We can also write the two steps of aggregatio in one command without an intermediate variable:
data.groupby([‘Country’],dropna=False).agg({‘Quantity’: [‘sum’,’mean’], ‘CustomerID’: [‘count’] })
If you want to use only one metrics function on all the aggregated fields you don’t have to use the ‘agg’ function. In such case the function will run on all the fields of the table.
agg_var.mean()
or short form you can just use:
data.groupby([‘Country’],dropna=False).mean()
Join tables in Pandas with merge function
Join tables (DataFrame) in a pandas like we usually do in SQL can be performed by the function “merge”.
pd.merge(data, country_descrption, on=’Country’, how=’left’)
The first field is the main table (like the “form” table in SQL language) and the second table would be the table we want to join.
The on parameter is the key. The how parameter is the join type. The default how type is “inner” (i.e., connect the tables and return only the rows where there is a match between the keys in the two tables).
In the above example the key field name of the tables was similar (“Country”) but if this name is different you can use the following parameters:
left_on – the key fields of the left table.
right_on– the key fields of the right table.
if you want to connect tables with more than one key field you should write the variables in brackets:
pd.merge(data, city_descrption, on=[‘Country’,’City’], how=’left’)
Combine tables (Union) in Pandas using concat function:
Combine Tables like Union Clause in SQL are executed using the “concat” function as follows:
pd.concat([data, more_data]) #union
pd.concat([data, more_data]).drop_duplicates() #union all
In the above examples I attach the “data” and “more_data” tables. The first example is similar to the operation of “union all” in SQL while the second example is similar to “union” operation because I omitted the double rows.
Visualisation of data in the Python language using the seaborn package
The Python language allows us to display data and graphs. I think the easiest and most convenient way to view data is with Power BI but if you have already cerated the analyzes in Python, you may also want to view charts in Python.
There is a convenient library (as much as convenient can be when working with data in Python) called seaborn that helps us display charts. This package allows you to create a variety of charts. In the following examples I will show some simple charts that will allow you to understand how to work with the package.
Installation of Seaborn package
To work with the package you must run the command
import seaborn as sns
If the package does not exist in the work environment you can install it by running this command in your operating system .
pip install seaborn
You may need the help of a technical person to install the package.
Boxplot cart in Python – in the Seaborn package
For the Boxplot chart you may use this code:
sns.boxplot(
x=”Country”,
y=”age”,
data=data, # the data frame
order=data[“Country”].value_counts().iloc[:5].index #top 5
)
Explanation:
in parameters X and Y you can choose the variables that the Boxplot will work with. In the “data” parameter you can choose the DataFrame and in the “Order” parameter you can set the order by which the columns will be displayed. In the current graph I also limited the number of columns to 5. You can change this number to any number you want.
Histogram in Python using Seaborn package
For a histogram chart you may use this code:
sns.distplot(
data[“age”],
bins=20,
kde=False
)
Explanation:
In the above command create the histogram on the “age” variable with 20 bins. The “kde” parameter can help us to see the trend of the histogram. In this example I chose not to see the trend.
A few more words about Statsmodels package
The statsmodels package is a Python package that enables us to perform deeper statistical analyzes on the DataFrame. With this package you can use advanced statistics functions, regression analyzes, significance tests and more. As analysts you will usually not have to use this package, but sometimes if you will need to perform advanced statistical analyzes this package can be very useful.
A few words about scikit-learn (or sklearn) package
The scikit-learn package (also known as sklearn) is a package that specialises in creating machine learning models on the DataFrames. With this package you can build prediction models, run regressions of various types, perform cluster analysis, train neural networks and more.
This article was written by Yuval Marnin.
If you have a need to hire a freelancer data analyst you may contact me at: [email protected]
You may also hire me through upwork platform on that link:
https://www.upwork.com/freelancers/~018940225ce48244f0\
Further reading
The advantage of hiring a freelance data analyst.
What does a data analyst is doing and how it can help your company.