Handling missing values – Part 2
Hi ML Enthusiasts! In this tutorial, we will be continuing our tutorial on how to handle missing data. Before giving this a read, please be sure to go through our previous article on the same topic, Handling missing values – Part 1.
Importing the libraries
The first step will be to import the libraries.
import numpy as np
import pandas as pd
Importing the data
Now, let’s import the data. I have already prepared the data in the form of a csv file having missing values in it. Let’s use read_csv() function of pandas to read it. The output of read_csv() function is a dataframe. In the code below, we will storing the output of read_csv function in a variable “data”. After this, data will be our dataframe on which we will be operating.
data = pd.read_csv("Data.csv")
Handling missing values – Fillna() function
As can be seen from above, the dataframe has many NaN’s or missing values.
Let’s check first how many missing values are there in the dataframe by first applying isnull() function on it and then applying sum() function on it. Please note that sum() counts only True values corresponding to each column.
missing_values = data.isnull().sum()
missing_values
Finding percentage of missing values
Looking at the count above, we can see that we have missing values in almost every column. Now, let’s find out the percentage of missing values in our dataset.
total_rows = data.shape[0]
total_columns = data.shape[1]
total_cells = total_rows * total_columns
#Total cells in original dataframe is:
total_cells
total_missing_Values_in_each_column = missing_values.sum()
total_missing_Values_in_each_column
percent_of_missing_values = round((total_missing_Values_in_each_column/total_cells)*100, 2)
percent_of_missing_values
Thus, 23.17% of the data contains missing values.
Let’s now discuss how to use fillna() function to handle this 23.17% of data.
Handling missing values methods
1. Replacing with 0
Though not a great one, one option is to replace all missing values with 0.
data.fillna(0).head(10)
As can be seen above, all missing values got replaced with 0.
2. Forward fill method
Original dataframe
data.head(10)
Dataframe after ffill method
data.fillna(method = 'ffill').head(10)
If you look at column B closely in original as well as forward filled dataframe, you’ll see in original dataframe, column B has NaN at 3rd row and in forward imputed dataframe, it got replaced with 3.939059, the value just previous to it. The only limitation with forward fill method is that the topmost or first NaN value remains as it is – it doesn’t get replaced.
3. Backward fill method
Original dataframe
data.head(10)
Dataframe after bfill method
data.fillna(method = 'bfill').head(10)
In this case also, limitation is the last NaN value doesn’t get replaced. Here, the 10th row is having all filled cells only because the dataframe has more than 10 rows and the 11th row contains non-NaN values.
If you want the filling to be done column wise, you will have a specify the axis as 1 as axis = 0 by default and the filling takes row-wise in this case by default.
Original dataframe
data.head(10)
Column-wise bfill method
data.fillna(method='ffill', axis = 1).head(10)
If you closely look the first row, we can see that the last NaN values get replaced with -0.011685, the values just before the NaN values in column I. Same is the case with the NaN cells of first row in column B and C, theygot replaced with the value in column A, i.e., -0.487497
4. By using SimpleImputer() from sklearn.impute
data.head(10)
from sklearn.impute import SimpleImputer
#Inititalizing the object of SimpleImputer class
data_Imputer = SimpleImputer()
#Fitting and transforming the data using the data_Imputer object and then wrapping it around pd.DataFrame function to convert it into a dataframe from numpy array.
imputedData = pd.DataFrame(data_Imputer.fit_transform(data))
imputedData.head(10)
The default value for this method is the mean value of imputation. That is, each NaN value of column A is filled with the mean of the non-NaN data points corresponding to column A.
There are few points to be noted here. When doing missing value imputation to continuous quantitative data, we should look at the histogram of a specific column.
- If the distribution comes out to be normal (approximate to bell curve), we should do missing value imputation with mean.
- If the distribution comes out to be skewed, we should do this with median.
For categorical data,
- we should do missing value imputation with mode.
We will learn more about this when we’ll be studying statistics in the coming few days.
So guys, with this we conclude our tutorial on Handling missing values. In the next part of this Handling missing data series, we will dive little deeper and will use statistics to perform missing value imputation.
Also, have a look at our YouTube channel – Like, comment and Subscribe! Stay tuned!
2 thoughts on “Handling missing values – Part 2”