What is Python Pandas?

Python Pandas is an open-source data science library built on the python programming language that is useful in data analysis and manipulation.

 

 

How to install Python Pandas?

To install Python Pandas we will use a python package manager called ‘pip’.

 

 

Python Pandas allows its users to work with data frames easily and in the most efficient manner. DataFrames can be likened to an excel sheet that has rows and columns on data.

 

Learn how to use Python Pandas online

 

Data analysis using Python Pandas

Using Python Pandas we can easily import files that are in CSV format, run some codes on them or analyze the data and then export it out. There are several options for running code when working with data and the most common one is the use of Jupyter notebooks.

It is also possible to use the notebook within VS code, however, note that all this code works perfectly on a normal python script as well.

 Using notebooks is common among data scientists since it is easier to run all the commands separately. We will start off by importing Pandas into the working space as an alias.

 

 

General industry standards recommend importing and using Python Pandas as an alias instead of typing the word pandas every time we need to use it. We can now import a CSV file into the workspace so that we can work on it.

The file that we are importing below has 100 rows of ‘fake’ or random data. Whenever importing a file it is good that the data frame is given a name in this case we will name it ‘df’.

However, when writing code in a production environment it should be given a proper name. We then do an equal sign before calling on Pandas as pd.read_csv(‘ ’).

 

 

This is going to read the data from the CSV file and store it into the data frame that we have named df. Now that we have got all the information stored in our data frame we can start extracting information about the dataset. To do that we use the code below:

 

Code
df.info()
Output
<class ‘pandas.core.frame.DataFrame’>RangeIndex: 1000 entries, 0 to 999

Data columns (total 6 columns):

 #   Column       Non-Null Count  Dtype  

—  ——       ————–  —–  

 0   id           1000 non-null   int64  

 1   first_name   1000 non-null   object 

 2   last_name    1000 non-null   object 

 3   email        1000 non-null   object 

 4   ip_address   1000 non-null   object 

 5   app_version  1000 non-null   float64

dtypes: float64(1), int64(1), object(4)

memory usage: 47.0+ KB

 

As shown in the output above this code gives us essential characteristics about the data that includes the number of columns, rows the data types present which are integer in this case and the app version that is a floating-point number.

 

Python Pandas is a useful tool while coding

 

We can also explore the structure of the data and get to know what the data actually looks like by using the code below.

 

Code
df.head()
Output
  
id first_name last_name email ip_address app_version
0 1 Paulina Perello pperello0@wordpress.com 139.93.249.108 1.1
1 2 Bucky Murcutt bmurcutt1@google.com.hk 237.128.217.183 1.8
2 3 Ibby Gout igout2@exblog.jp 77.43.9.164 1.5
3 4 Bernadina Belvin bbelvin3@yahoo.co.jp 182.13.149.38 1.4
4 5 Ruperta Grimsdale rgrimsdale4@shareasale.com 172.78.182.216 1.4

 

Although it is optional we can pass the number of rows that we want to be returned but in the above case, we have not passed any number so the default number of rows returned will be five. This allows us to have a quick check on how the data looks like. 

We can also check the bottom bit of the data and have a look at what it looks like.

 

Code
df.tail()
Output
id first_name last_name email ip_address app_version
995 996 Ainslie Windebank awindebankrn@4shared.com 25.206.138.45 1.4
996 997 Nevins Whetland nwhetlandro@shutterfly.com 88.106.2.5 1.0
997 998 Ira Bardsley ibardsleyrp@nih.gov 115.212.22.253 1.5
998 999 Samaria Brettelle sbrettellerq@kickstarter.com 167.190.74.169 1.2
999 1000 Seward McKernon smckernonrr@dailymail.co.uk 116.158.84.20 1.8

Creating a Pivot Table

Creating a pivot table is a relatively simple task in excel. However, let’s say that we need to do this, every day or every week, we could write a script in python that would do it for us, input the file and export it with your pivot table.

Using a pivot table we intend to find out how 100 people are using different versions of this app. To create a pivot table in Pandas we need to create a new data frame in this case we will name it ‘pivot_df’ and then equate it to ‘pd.pivot_table’. 

 

Python Pandas is the best if you'd like to work with data

 

Now we can start to construct the information that we want to go into our pivot table. And the first thing we need to give it is our main data frame with all the information in it. The name df represents the data frame that we intend to use, and then a comma.

We now need to tell it the indexes that are going to be on the left-hand side of the pivot table. On the other hand, when using  Excel we’re going to look up the data again.

So in this case the index is equal to a list because we could have multiple indexes. In addition, we’re going to say app_version because we want to know out of all the app versions how many people this particular version.

If we now run this code we will have a pivot table with an assumption that we want the id, however, this might not always be the case.

 

Code
pivot_df = pd.pivot_table(df,index = [‘app_version’])

pivot_df

Output
id
app_version
1.0 507.173077
1.1 490.693878
1.2 517.680000
1.3 483.259615
1.4 490.696078
1.5 492.500000
1.6 493.488095
1.7 500.255319
1.8 551.635514
1.9 476.666667
2.0 486.187500

 

The id column is, in this case, correct but in other scenarios, this might not be but it’s actually going and looks like it’s either the total or probably the total or the mean which is not what we want and neither is it of use to us.

Therefore we need to pass in more arguments, for example, we can say that we want to look up against the id.

 

Code
pivot_df = pd.pivot_table(df,index = [‘app_version’], values = [‘id’])

pivot_df

Output
id
app_version
1.0 507.173077
1.1 490.693878
1.2 517.680000
1.3 483.259615
1.4 490.696078
1.5 492.500000
1.6 493.488095
1.7 500.255319
1.8 551.635514
1.9 476.666667
2.0 486.187500

 

Again we’ve got the same data back because it had already made that assumption but we need the ‘id’ column there. Since we want to know the count of the individual id’s and not the sum or the mean or anything like that, we are going to use the ‘aggfunc’ function.

 

Code
pivot_df = pd.pivot_table(df, index = [‘app_version’], values = [‘id’], aggfunc = [‘count’])

pivot_df

Output
count
id
app_version
1.0 52
1.1 98
1.2 125
1.3 104
1.4 102
1.5 96
1.6 84
1.7 94
1.8 107
1.9 90
2.0 48

 

In this case, we get whole numbers as the id’s of how many different ideas are on that version there. Using excel we will obtain a pivot table with the app_version and the ‘id’ under the count which is exactly what we have obtained in this case. This is a little bit of data analysis with python and we’ve got a pivot table showing  how many users are using each version of the app

Suppose that we wanted to share this output with colleagues, then in such a case, we have to export the output again. Fortunately, Python Pandas has a really simple way of executing that. All we need to do is we need to call the data frame that we want to export in this case, this is the ‘pivot_df’ data frame and then we want to do pivot_df.to_csv() and then give it a final name which in this case we will name ‘results.csv’.

 

Code
pivot_df.to_csv(‘results.csv’)

 

The file ‘results.csv’ will be created in the working directory that our code editor is set to.

If you’d like to see more programming tutorials, check out our Youtube channel, where we have plenty of Python video tutorials in English.

In our Programming Tutorials series, you’ll find useful materials which will help you improve your programming skills and speed up the learning process.

Would you like to learn how to code, online? Come and try our first 25 lessons for free at the CodeBerry Programming School.

Learn to code and change your career!

100% ONLINE

IDEAL FOR BEGINNERS

SUPPORTIVE COMMUNITY

SELF-PACED LEARNING

Not sure if programming is for you? With CodeBerry you’ll like it.