New Delhi, India

Introduction to Pandas

Introduction to Pandas

Hi ML Enthusiasts! Today, we will be learning about one of the most popular and power package of Python, Pandas and its usage in the world of data science.

 

Dataframe and series

The package pandas has been built on top of numpy and provides an efficient tool to manipulate dataframe.

Dataframes are defined as multidimensional arrays often distributed in the form of rows and columns with row labels and column labels respectively, each column defined as Series. A dataframe can contain heterogeneous data – each Series can have data type different from other Series. A dataframe can have missing data, duplicates, garbage values etc and pandas help us in munging this data.

Introduction to Pandas – Importing pandas

Now, let’s learn how we can call pandas and what operations can be done by it.
In order to import pandas, we use the following command:

In [0]:
import pandas

Introduction to Pandas – Checking version

If you want to know the version of the pandas package, then the following command will get you that:

In [39]:
pandas.__version__
Out[39]:
'0.25.3'

Also, you’ll see we will be calling pandas a lot of times in the code and typing full package name can be tedious, so in order to combat this, we use alias. We define alias pd for package pandas in the following way:

In [0]:
import pandas as pd
import numpy as np

Fetching documentation and namespace

In order to display the built-in documentation of pandas package, the following command will help:

In [0]:
pd?

And to know about the list of built-in functions in pandas namespace, the following command will do the job:

In [42]:
#pd.
pd.melt

Series – a one-dimensional array can be created from list as follows:

In [43]:

 

df = pd.Series([0, 1, 2, 3, 4])
df
Out[43]:
0    0
1    1
2    2
3    3
4    4
dtype: int64

If we want to fetch the values in a series, we use following command:

In [44]:
df.values
Out[44]:
array([0, 1, 2, 3, 4])

In order to access one particular value of df at a particular index, i.e., value at 2nd index, the following command will do the job:

In [45]:
df[1]
Out[45]:
1

To extract a part of series, i.e., all values from 1st index to 3rd index, use the following command:

In [46]:
df[1:4]
Out[46]:
1    1
2    2
3    3
dtype: int64

The last value, i.e., the 4th index got excluded and 1 was included, i.e., in 1:4, 1 is included and 4 is excluded. We are extracting all values >=1 and <4. If we want to fetch value at 4th index also, we will need to use following command:

In [47]:
df[1:5] #or
df[1:]
Out[47]:
1    1
2    2
3    3
4    4
dtype: int64

Dataframe from series

Now, let’s talk about how to create dataframe from two or more series

In [48]:
#Dictionary having keys as houses and values as no of rooms
dict1 = {"House A": 2, "House B": 3, "House C": 4}   

#Converting dictionary to series
NoOfRooms = pd.Series(dict1)   
NoOfRooms
Out[48]:
House A    2
House B    3
House C    4
dtype: int64
In [49]:
dict2 = {"House A": 5000000, "House B": 6000000, "House C": 7000000}
PriceInRupees = pd.Series(dict2)
PriceInRupees
Out[49]:
House A    5000000
House B    6000000
House C    7000000
dtype: int64
In [50]:
HousesData = pd.DataFrame({"No of rooms": NoOfRooms, "Price in Rupees": PriceInRupees})
HousesData
Out[50]:
No of rooms Price in Rupees
House A 2 5000000
House B 3 6000000
House C 4 7000000

Here, we formed two dictionaries, then converting each of them to a pandas series and then made a dictionary of both these series and then converted that dictionary into a dataframe. To know the names of indices and columns, we use following commands:

In [51]:
HousesData.index
Out[51]:
Index(['House A', 'House B', 'House C'], dtype='object')
In [52]:
HousesData.columns
Out[52]:
Index(['No of rooms', 'Price in Rupees'], dtype='object')

Accessing rows and columns

Accessing a particular column/row can be done in the following way:

In [53]:
HousesData['No of rooms']
Out[53]:
House A    2
House B    3
House C    4
Name: No of rooms, dtype: int64
In [54]:
#loc helps us in fetching the row corresponding to the row name mentioned in the square brackets
HousesData.loc['House A']      
Out[54]:
No of rooms              2
Price in Rupees    5000000
Name: House A, dtype: int64
In [55]:
#If you want to access the multiple rows by means of row names, we pass the list of row names
HousesData.loc[['House A', 'House B']]    
Out[55]:
No of rooms Price in Rupees
House A 2 5000000
House B 3 6000000
In [56]:
#iloc is used when we want to access the rows by means of index numbers
HousesData.iloc[1:]              
Out[56]:
No of rooms Price in Rupees
House B 3 6000000
House C 4 7000000

Preparing a dataframe with a single series can be done as follows:

In [57]:
pd.DataFrame(NoOfRooms, columns = ['No of rooms'])
Out[57]:
No of rooms
House A 2
House B 3
House C 4

Dataframe from array

Dataframe from a numpy array can be made as follows:

In [58]:
onesArray = np.ones(3, dtype = [('X', 'i8'), ('Y', 'f8')])
onesArray
Out[58]:
array([(1, 1.), (1, 1.), (1, 1.)], dtype=[('X', '<i8'), ('Y', '<f8')])
In [59]:
pd.DataFrame(onesArray)
Out[59]:
X Y
0 1 1.0
1 1 1.0
2 1 1.0

Link to Introduction to Pandas video tutorial

Here is the link to the YouTube video for this blog post Introduction to Pandas.

So guys, with this we conclude this tutorial on pandas. In the next tutorial, we will be learning how to handle missing values using python (Pandas & Numpy).

 

2 thoughts on “Introduction to Pandas

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: