How to use Python Pandas – with example

  • Reading time:20 mins read

What is the Python Pandas library?

Python Pandas is an open-source data science library built on the Python programming language that is useful in data analysis and manipulation. To install Pandas we will use a python package manager called ‘pip’.  

 

pip install pandas

 

Python Pandas allow its users to work with data frames easily and in the most efficient manner. DataFrames can be likened to an excel sheet that has rows and columns on data.

 

 

Data analysis using Python Pandas

By using Python Pandas we can easily import files that are in CSV format, run some codes on them or analyze the data and then export it out. There are several options for running code when working with data and the most common one is the use of Jupyter notebooks.

It is also possible to use the notebook within VS code, however, note that all this code works perfectly on a normal python script as well.  Using notebooks is common among data scientists since it is easier to run all the commands separately. We will start off by importing pandas into the working space as an alias. 

 

import pandas as pd

 

General industry standards recommend importing and using pandas as an alias instead of typing the word pandas every time we need to use it. We can now import a CSV file into the workspace so that we can work on it. The file that we are importing below has 100 rows of ‘fake’ or random data.

 

How to use Python Pandas

 

Whenever importing a file it is good that the data frame is given a name in this case we will name it ‘df’. However, when writing code in a production environment it should be given a proper name. We then do an equal sign before calling on pandas as pd.read_csv(‘ ’).

 

df = pd.read_csv(' ')

 

This is going to read the data from the CSV file and store it into the data frame that we have named df. Now that we have got all the information stored in our data frame we can start extracting information about the dataset. To do that we use the code below:

 

df.info()

Output:

RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           1000 non-null   int64  
 1   first_name   1000 non-null   object 
 2   last_name    1000 non-null   object 
 3   email        1000 non-null   object 
 4   ip_address   1000 non-null   object 
 5   app_version  1000 non-null   float64
dtypes: float64(1), int64(1), object(4)
memory usage: 47.0+ KB

 

As shown in the output above this code gives us essential characteristics about the data that includes the number of columns, rows the data types present which is an integer in this case and the app version that is a floating-point number.

Reading the data from a CSV file with Python Pandas

We can also explore the structure of the data and get to know what the data actually looks like by using the code below. 

 

df.head()

Output:


	id	first_name	last_name	email	ip_address	app_version
0	1	Paulina	Perello	pperello0@wordpress.com	139.93.249.108	1.1
1	2	Bucky	Murcutt	bmurcutt1@google.com.hk	237.128.217.183	1.8
2	3	Ibby	Gout	igout2@exblog.jp	77.43.9.164	1.5
3	4	Bernadina	Belvin	bbelvin3@yahoo.co.jp	182.13.149.38	1.4
4	5	Ruperta	Grimsdale	rgrimsdale4@shareasale.com	172.78.182.216	1.4

 

Although it is optional we can pass the number of rows that we want to be returned but in the above case, we have not passed any number so the default number of rows returned will be five. This allows us to have a quick check on what the data looks like.  We can also check the bottom bit of the data and have a look at what it looks like.  

 

df.tail()

Output:

	id	first_name	last_name	email	ip_address	app_version
995	996	Ainslie	Windebank	awindebankrn@4shared.com	25.206.138.45	1.4
996	997	Nevins	Whetland	nwhetlandro@shutterfly.com	88.106.2.5	1.0
997	998	Ira	Bardsley	ibardsleyrp@nih.gov	115.212.22.253	1.5
998	999	Samaria	Brettelle	sbrettellerq@kickstarter.com	167.190.74.169	1.2
999	1000	Seward	McKernon	smckernonrr@dailymail.co.uk	116.158.84.20	1.8

 

Creating a Pivot Table with Python Pandas

Creating a pivot table is a relatively simple task in excel. However, let’s say that we need to do this, every day or every week, we could write a script in python that would do it for us, input the file and export it with your pivot table. Using a pivot table we intend to find out how 100 people are using different versions of this app.

To create a pivot table in Python Pandas we need to create a new data frame in this case we will name it ‘pivot_df’ and then equate it to ‘pd.pivot_table’.  Now we can start to construct the information that we want to go into our pivot table. And the first thing we need to give it is our main data frame with all the information in it.

 

How to use Python Pandas

 

The name df represents the data frame that we intend to use, and then a comma. We now need to tell it the indexes that are going to be on the left-hand side of the pivot table. On the other hand, when using  Excel we’re going to look up the data again.

So in this case the index is equal to a list because we could have multiple indexes. In addition, we’re going to say app_version because we want to know out of all the app versions how many people this particular version. If we now run this code we will have a pivot table with an assumption that we want the id, however, this might not always be the case.

 

pivot_df = pd.pivot_table(df,index = ['app_version']) pivot_df

Output:

              id
app_version	

1.0	507.173077
1.1	490.693878
1.2	517.680000
1.3	483.259615
1.4	490.696078
1.5	492.500000
1.6	493.488095
1.7	500.255319
1.8	551.635514
1.9	476.666667
2.0	486.187500

 

Advanced formatting with Python Pandas

The id column is, in this case, correct but in other scenarios, this might not be but it’s actually going and looks like it’s either the total or probably the total or the mean which is not what we want and neither is it of use to us. Therefore we need to pass in more arguments, for example, we can say that we want to look up against the id.  

 

pivot_df = pd.pivot_table(df,index = ['app_version'], values = ['id']) pivot_df

Output:


	id
app_version	

1.0	507.173077
1.1	490.693878
1.2	517.680000
1.3	483.259615
1.4	490.696078
1.5	492.500000
1.6	493.488095
1.7	500.255319
1.8	551.635514
1.9	476.666667
2.0	486.187500

 

Again we’ve got the same data back because it had already made that assumption but we need the ‘id’ column there. Since we want to know the count of the individual id’s and not the sum or the mean or anything like that, we are going to use the aggfunc function.

 

pivot_df = pd.pivot_table(df, index = ['app_version'], values = ['id'], aggfunc = ['count']) pivot_df

Output:

	count
	id
app_version	

1.0	52
1.1	98
1.2	125
1.3	104
1.4	102
1.5	96
1.6	84
1.7	94
1.8	107
1.9	90
2.0	48

 

In this case, we get whole numbers as the id’s of how many different ideas are on that version there. Using excel we will obtain a pivot table with the app_version and the ‘id’ under the count which is exactly what we have obtained in this case.

This is a little bit of data analysis with python and we’ve got a pivot table showing how many users are using each version of the app. Suppose that we wanted to share this output with colleagues, then in such a case, we have to export the output again.

Fortunately, pandas have a really simple way of executing that. All we need to do is we need to call the data frame that we want to export in this case, this is the ‘pivot_df’ data frame and then we want to do pivot_df.to_csv() and then give it a final name which in this case we will name ‘results.csv’.

 

pivot_df.to_csv('results.csv')

 

The file ‘results.csv’ will be created in the working directory that our code editor is set to.

 

How to use Python pandas

 

Summary

This is how Python Pandas works in practice. If you’d like to see more programming tutorials, check out our Youtube channel, where we have plenty of Python video tutorials in English.

In our Python Programming Tutorials series, you’ll find useful materials which will help you improve your programming skills and speed up the learning process.

Programming tutorials

Would you like to learn how to code, online? Come and try our first 25 lessons for free at the CodeBerry Programming School.

Learn to code and change your career!

100% ONLINE

IDEAL FOR BEGINNERS

SUPPORTIVE COMMUNITY

SELF-PACED LEARNING

Not sure if programming is for you? With CodeBerry you’ll like it.