Data analysis with Python part 1

You are currently viewing Data analysis with Python part 1
  • Post category:Python
  • Reading time:13 mins read

Introduction – Data analysis with Python

This post is the first part of the Python Data Analysis Guide.

I will start the guide with reservations: I do not like the idea of ​​data analysis with Python. Python is a programming language, it was not developed for data analysis and in my opinion this is the root cause of the problems in data analysis with python. In this brief introduction to the guide I will explain why I this this way:

In 1991, the Python language was born by a Dutch programmer named Guido van Rossum. The language was developed in a way that allows it to expand its basic capabilities by creating additional software packages.

In 2015, a new package called Pandas was released. Pandas added the ability to perform data analysis with Python.

Despite all the new capabilities that Pandas added to the python framework, it is still an extension of a programming language that was not originally intended for data analysis. Therefore, in my opinion, data analysis in Python is not recommended because writing code in Python is a cumbersome process. Many hours of writing code are wasted on debugging and searching for their solution. Analysis that can be easily performed by SQL language may make long confusing code lines. In addition to these difficulties, you will also have to master annoying concepts such as indexes and variable types that do not always work together and may create confusion and mistakes.

The main advantages of creating analysis in the Python language are The ability to easily write functions that can perform the same operations on many tables and its ability to create statistical models and Machine learning algorithms. 

So despite my introduction, if you are still interested in learning to analyze data with Python language, or have been forced by the job market to work with this work environment, I have prepared a guide that can help you create some basic analysis with python.

Please note, the guide assumes that you have a basic knowledge of programming and understanding of data analysis in SQL.

In addition, it is important to know that the language of Python changes frequently. New functions come in all the time and sometimes even part of the syntax changes. There for, If you find a mistake you are welcome to contact us by email: [email protected]

How to practice data analysis with Python language

In order to work with Python language you can install a work environment called Anaconda. This work environment includes the Python language, data analysis packages (like Pandas) and a tool for creating Notebooks called Jupiter.

Notebooks are the recommended way to work with data analysis in Python.

There is also an option to practice Python online for free. On the website strata scratch you can find an interface of the Python language with tables and questions from job interviews that can be practiced for free at the basic levels (proper disclosure, registration to the site through the link rewards me a small fee).

General notes on the Python language It is important to know that

  • The Python language is case-sensitive.
  • With the help of the marking # it is possible to write comments (like the mark — in SQL language)
  • As opposed to SQL language Python language is very sensitive to tabs and white spaces. If you write in a way that the language does not like you will get an error.
  • The “print” command can output almost any variable.
  • Each output can be inserted into a new variable and manipulated.

Importing packages into the Python environment

In order to use data analysis packages we need to import them. The import operation is simply called “import” and you should write like this at the head of your code: 

import numpy as np
import pandas as pd

To the “pandas” package that deals with the analysis of the data I called “pd” I will use this alias letter whenI will need to use a function from “pandas”.

DataFrame

The type of variable in which Pandas stores tables is called “DataFrame”, So in Pandas the term DataFrame is equivalent to the term “Table”. The DataFrame fields are actually a collection of “Series” variables. It is important to know it because sometimes if we want to work only a single column of the DataFrame we will need to work with functions that handels “Series” variables type.

Importing Data into Python

There are several ways to import tables into Python’s environment. It might be  possible that when running the code things will not work smoothly and it will be necessary to change the basic commands.

Input CSV file

path = r “c:\data.csv”
data = pd.read_csv(path)

Explanation
The path variable contains the path of the CSV file we want to import.At the following command we order Pandas to import the table from the CSV file and put it in the “data” variable.

The “data” variable can be changed to any names you want. For example, if you import loans table you can write:

loans = pd.read_csv(path)

Import Excel file 

path = r”C: \ data.xlsx”
data = pd.read_excel(path)

Retrieving data from a database

Any database that works a bit different and pandas may connect to it in a different way. In this example I connect MySQL and perform SQL command that retrieved my table. 

engine = create_engine(“mysql://root:root@localhost/ecommerce?host=localhost?port=3306”)

#root – user
#root – password
#ecommerce- the name of default db

sql=”‘
select * from customers;
“‘ 

sql_table=pd.read_sql_query(sql,engine)

Basic operations of DataFrame in Pandas

table retrieval

To view the table we captured in the previous section it is enough to write the table name and run:

data

If we want to see only the first 10 rows you can run:

data.head(10)

If we want to see only the last 10 rows you can run:

data.tail(10)

Retrieving fields from DataFrame

There are several ways to retrieve fields in Pandas’ DataFrame:

You can register the field name using a dot like in SQL langue:

data.CustomerID

or Square brackets and quotation marks:

data[“CustomerID”]

If you want to select more than one variable you need to use double square brackets:

data[[“CustomerID”, “Country”]]

 You can also select variables according to their location using the iloc function:

data.iloc[:,2:4]

Note on the output of the retrieval:

When using single square brackets data [“CustomerID”] or using the data.CustomerID point the output obtained is “Series” data type. When using two square brackets data [[“CustomerID”]] or the iloc function you get a DataFrame output.

This issue is important because operations can be performed on the output according to its variable type. Some functions can only work on “DataFrame” data types and others can only work on Series.

Using the “inplace” inplace

Using the “inplace” parameter in Pandas functions instructs Pandas to make a change to our DataFrame. If we do not add this parameter, the change will not take place and will only exist at the output of the function (see example in the next section).

Removing a field from the DataFrame

In the following example I created a new DataFrame called “a” that would be identical to the data table but would remove the “CustomerID” field. In the “data” variable the  “CustomerID” will still exist.

a = data.drop(‘CustomerID’, axis = 1)

Similarly, to the above example, If  I can changed the data table using the parameter “inplace = True”

data.drop(‘CustomerID’, axis = 1, inplace = True)

Renaming a field

To rename a field in the DataFrame, the Pandas package uses a Python data structure called “dictionary” and the renaming command would look like this::

renamed_table=data.rename(
index=str,
columns=
 {“CustomerID”: “CustomerID_new”,
  “Country”: “Country_new”,
 }
)

Manipulation of fields and adding new fields in DataFrame

data[“UnitPrice_with_dicount”] = data[“UnitPrice”] * 0.9

UnitPrice_with_dicount field is a new field in wuch I calculated the Unit Price with 10% discount. Similarly it is possible to perform manipulations with other  mathematical functions or string manipulation.

Sort a DataFrame

sorted in Pandas will be carried out using the function data.sort_values

sort_value( ‘age’)

sort can also be in descending order:

data.sort_values(‘age’,ascending=False)

Statistical analysis with Pandas 

Number of lines and variables

To get the number of rows and variables in a DataFrame you can use the “shape” property like this:

data.shape

The answer will be in the form of
(541909, 8)
which means that in the DataFrame there are 541,909 rows and 8 columns. 

Description of the variables types in the DataFrame 

The “Info” command will return the data type of each variable in the DataFrame and how many missing values ​​it has.

data.info()

Statistical description of varible in DataFrame

The function “describe” is a very useful function for basic statistical descriptive of variables in a DataFrame.

data.describe(include = ‘all’)

You can add the parameter (include = ‘all’) to get all the statistical metrics that the function allows.

Using statistical functions

If you want to get only one statistical value and not a list of values like ​​as in the describe command, you can run the statistical functions alone like in the dollowing example:

data.Quantity.mean()

The values ​​obtained can be entered into variables and manipulated like in the following examples:

mean=data.Quantity.mean()
median=data.Quantity.median()

if mean>median:
    print(‘mean is greater than median’)
else:
    print(‘median is greater or equal to mean’)

Pandas package has many standard statistical functions (which are actually aggregate functions). In addition it also has a function for counting values ​​called “count” and another function for counting unique values called “​​nunique” which acts like “distinct count” in SQL language.

Distribution of Categorical Values

Another useful function of Pandas that can help analyze categorical variable data is “value_counts”.

With this function you can generate a distribution of values ​​in a categorical variable:

data[‘Country’].value_counts()

Create a sub table

To create a sub-table in SQL we filter the rows we want to leave with the help of the  “where” Clause. Similarly in Pandas you can use the “loc” function.

The loc function allows us to filter rows by condition and select the fields we want to use:

data.loc[data[“Country”]==’ Israel’]
data.loc[data[“Country”]==’ Israel’,[“Country”,”CustomerID”]]

Explanation:

In the first example I selected only the rows where the country is Israel, in the second example I repeated the first condition and in addition to filtering the rows I selected only the fields Country and CustomerID.

Note: There is more than one way to create a sub-table in Pandas.


This article was written by Yuval Marnin.
If you have a need to hire a freelancer data analyst you may contact me at: [email protected]

Yuval Marnin

For data analytics mentoring services: [email protected]