Introduction to Pandas
Hi ML Enthusiasts! Today, we will be learning about one of the most popular and power package of Python, Pandas and its usage in the world of data science.
Dataframe and series
The package pandas has been built on top of numpy and provides an efficient tool to manipulate dataframe.
Dataframes are defined as multidimensional arrays often distributed in the form of rows and columns with row labels and column labels respectively, each column defined as Series. A dataframe can contain heterogeneous data – each Series can have data type different from other Series. A dataframe can have missing data, duplicates, garbage values etc and pandas help us in munging this data.
Introduction to Pandas – Importing pandas
Now, let’s learn how we can call pandas and what operations can be done by it.
In order to import pandas, we use the following command:
import pandas
Introduction to Pandas – Checking version
If you want to know the version of the pandas package, then the following command will get you that:
pandas.__version__
Also, you’ll see we will be calling pandas a lot of times in the code and typing full package name can be tedious, so in order to combat this, we use alias. We define alias pd for package pandas in the following way:
import pandas as pd
import numpy as np
Fetching documentation and namespace
In order to display the built-in documentation of pandas package, the following command will help:
pd?
And to know about the list of built-in functions in pandas namespace, the following command will do the job:
#pd.
pd.melt
In [43]:
df = pd.Series([0, 1, 2, 3, 4])
df
If we want to fetch the values in a series, we use following command:
df.values
In order to access one particular value of df at a particular index, i.e., value at 2nd index, the following command will do the job:
df[1]
To extract a part of series, i.e., all values from 1st index to 3rd index, use the following command:
df[1:4]
The last value, i.e., the 4th index got excluded and 1 was included, i.e., in 1:4, 1 is included and 4 is excluded. We are extracting all values >=1 and <4. If we want to fetch value at 4th index also, we will need to use following command:
df[1:5] #or
df[1:]
Dataframe from series
Now, let’s talk about how to create dataframe from two or more series
#Dictionary having keys as houses and values as no of rooms
dict1 = {"House A": 2, "House B": 3, "House C": 4}
#Converting dictionary to series
NoOfRooms = pd.Series(dict1)
NoOfRooms
dict2 = {"House A": 5000000, "House B": 6000000, "House C": 7000000}
PriceInRupees = pd.Series(dict2)
PriceInRupees
HousesData = pd.DataFrame({"No of rooms": NoOfRooms, "Price in Rupees": PriceInRupees})
HousesData
Here, we formed two dictionaries, then converting each of them to a pandas series and then made a dictionary of both these series and then converted that dictionary into a dataframe. To know the names of indices and columns, we use following commands:
HousesData.index
HousesData.columns
Accessing rows and columns
Accessing a particular column/row can be done in the following way:
HousesData['No of rooms']
#loc helps us in fetching the row corresponding to the row name mentioned in the square brackets
HousesData.loc['House A']
#If you want to access the multiple rows by means of row names, we pass the list of row names
HousesData.loc[['House A', 'House B']]
#iloc is used when we want to access the rows by means of index numbers
HousesData.iloc[1:]
Preparing a dataframe with a single series can be done as follows:
pd.DataFrame(NoOfRooms, columns = ['No of rooms'])
Dataframe from array
Dataframe from a numpy array can be made as follows:
onesArray = np.ones(3, dtype = [('X', 'i8'), ('Y', 'f8')])
onesArray
pd.DataFrame(onesArray)
Link to Introduction to Pandas video tutorial
Here is the link to the YouTube video for this blog post Introduction to Pandas.
So guys, with this we conclude this tutorial on pandas. In the next tutorial, we will be learning how to handle missing values using python (Pandas & Numpy).
One thought on “Introduction to Pandas”