Advertisements
New Delhi, India

Handling missing values – Part 2

Handling missing values – Part 2

Hi ML Enthusiasts! In this tutorial, we will be continuing our tutorial on how to handle missing data. Before giving this a read, please be sure to go through our previous article on the same topic, Handling missing values – Part 1.

In this part of handling missing values, we will be going little advanced and will be discussing different ways which can be used to deal with missing values. We will be talking about fillna() function and will be using it to discuss various methods of missing value imputation.

Importing the libraries

The first step will be to import the libraries.

In [0]:
import numpy as np
import pandas as pd

Importing the data

Now, let’s import the data. I have already prepared the data in the form of a csv file having missing values in it. Let’s use read_csv() function of pandas to read it. The output of read_csv() function is a dataframe. In the code below, we will storing the output of read_csv function in a variable “data”. After this, data will be our dataframe on which we will be operating.

In [3]:
data = pd.read_csv("Data.csv")

Handling missing values – Fillna() function

As can be seen from above, the dataframe has many NaN’s or missing values.
Let’s check first how many missing values are there in the dataframe by first applying isnull() function on it and then applying sum() function on it. Please note that sum() counts only True values corresponding to each column.

In [4]:
missing_values = data.isnull().sum()
missing_values
Out[4]:
A     0
B    30
C    30
D     2
E     2
F     2
G     2
H     2
I     0
J    30
K    30
dtype: int64

Finding percentage of missing values

Looking at the count above, we can see that we have missing values in almost every column. Now, let’s find out the percentage of missing values in our dataset.

In [7]:
 total_rows = data.shape[0]
 total_columns = data.shape[1]
 total_cells = total_rows * total_columns
 #Total cells in original dataframe is:
 total_cells
Out[7]:
561
In [8]:
total_missing_Values_in_each_column = missing_values.sum()
total_missing_Values_in_each_column
Out[8]:
130
In [10]:
percent_of_missing_values = round((total_missing_Values_in_each_column/total_cells)*100, 2)
percent_of_missing_values
Out[10]:
23.17

Thus, 23.17% of the data contains missing values.

Let’s now discuss how to use fillna() function to handle this 23.17% of data.

 

Handling missing values methods

1. Replacing with 0

Though not a great one, one option is to replace all missing values with 0.

In [12]:
data.fillna(0).head(10)
Out[12]:
A B C D E F G H I J K
0 -0.487497 0.000000 0.000000 0.176801 0.823199 0.188485 0.811515 0.823199 -0.011685 0.000000 0.000000
1 -0.832406 3.939059 -4.771465 0.188485 0.811515 0.209944 0.790056 0.811515 -0.021459 0.073096 -0.094555
2 0.315470 0.000000 0.000000 0.209944 0.790056 0.194976 0.805024 0.790056 0.014968 0.000000 0.000000
3 0.283631 -0.053466 0.337097 0.194976 0.805024 0.201061 0.798939 0.194976 0.006085 -0.001467 0.007551
4 0.473122 0.473122 0.000000 0.201061 0.798939 0.212307 0.787693 0.201061 0.011246 0.011246 0.000000
5 -0.414056 2.488650 -2.902706 0.212307 0.787693 0.211018 0.788982 0.212307 -0.001289 0.075263 -0.076551
6 0.072839 0.000000 0.000000 0.211018 0.788982 0.212731 0.787269 0.211018 0.001713 0.000000 0.000000
7 1.037462 1.065311 -0.027849 0.212731 0.787269 0.235407 0.764593 0.212731 0.022676 0.023817 -0.001140
8 0.131151 -0.165368 0.296519 0.235407 0.764593 0.242157 0.757843 0.235407 0.006750 -0.004869 0.011619
9 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

As can be seen above, all missing values got replaced with 0.

2. Forward fill method

In forward fill method, all the NaNs get replaced with the number just previous to them. We can apply this by passing ‘ffill’ argument to method parameter of fillna function.

Original dataframe

In [15]:
data.head(10)
Out[15]:
A B C D E F G H I J K
0 -0.487497 NaN NaN 0.176801 0.823199 0.188485 0.811515 0.823199 -0.011685 NaN NaN
1 -0.832406 3.939059 -4.771465 0.188485 0.811515 0.209944 0.790056 0.811515 -0.021459 0.073096 -0.094555
2 0.315470 NaN NaN 0.209944 0.790056 0.194976 0.805024 0.790056 0.014968 NaN NaN
3 0.283631 -0.053466 0.337097 0.194976 0.805024 0.201061 0.798939 0.194976 0.006085 -0.001467 0.007551
4 0.473122 0.473122 0.000000 0.201061 0.798939 0.212307 0.787693 0.201061 0.011246 0.011246 0.000000
5 -0.414056 2.488650 -2.902706 0.212307 0.787693 0.211018 0.788982 0.212307 -0.001289 0.075263 -0.076551
6 0.072839 NaN NaN 0.211018 0.788982 0.212731 0.787269 0.211018 0.001713 NaN NaN
7 1.037462 1.065311 -0.027849 0.212731 0.787269 0.235407 0.764593 0.212731 0.022676 0.023817 -0.001140
8 0.131151 -0.165368 0.296519 0.235407 0.764593 0.242157 0.757843 0.235407 0.006750 -0.004869 0.011619
9 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.000000 NaN NaN

Dataframe after ffill method

In [14]:
data.fillna(method = 'ffill').head(10)
Out[14]:
A B C D E F G H I J K
0 -0.487497 NaN NaN 0.176801 0.823199 0.188485 0.811515 0.823199 -0.011685 NaN NaN
1 -0.832406 3.939059 -4.771465 0.188485 0.811515 0.209944 0.790056 0.811515 -0.021459 0.073096 -0.094555
2 0.315470 3.939059 -4.771465 0.209944 0.790056 0.194976 0.805024 0.790056 0.014968 0.073096 -0.094555
3 0.283631 -0.053466 0.337097 0.194976 0.805024 0.201061 0.798939 0.194976 0.006085 -0.001467 0.007551
4 0.473122 0.473122 0.000000 0.201061 0.798939 0.212307 0.787693 0.201061 0.011246 0.011246 0.000000
5 -0.414056 2.488650 -2.902706 0.212307 0.787693 0.211018 0.788982 0.212307 -0.001289 0.075263 -0.076551
6 0.072839 2.488650 -2.902706 0.211018 0.788982 0.212731 0.787269 0.211018 0.001713 0.075263 -0.076551
7 1.037462 1.065311 -0.027849 0.212731 0.787269 0.235407 0.764593 0.212731 0.022676 0.023817 -0.001140
8 0.131151 -0.165368 0.296519 0.235407 0.764593 0.242157 0.757843 0.235407 0.006750 -0.004869 0.011619
9 0.000000 -0.165368 0.296519 0.235407 0.764593 0.242157 0.757843 0.235407 0.000000 -0.004869 0.011619

If you look at column B closely in original as well as forward filled dataframe, you’ll see in original dataframe, column B has NaN at 3rd row and in forward imputed dataframe, it got replaced with 3.939059, the value just previous to it. The only limitation with forward fill method is that the topmost or first NaN value remains as it is – it doesn’t get replaced.

3. Backward fill method

In this method, the NaN value gets replaced with the non-NaN value just next to it.

Original dataframe

In [16]:
data.head(10)
Out[16]:
A B C D E F G H I J K
0 -0.487497 NaN NaN 0.176801 0.823199 0.188485 0.811515 0.823199 -0.011685 NaN NaN
1 -0.832406 3.939059 -4.771465 0.188485 0.811515 0.209944 0.790056 0.811515 -0.021459 0.073096 -0.094555
2 0.315470 NaN NaN 0.209944 0.790056 0.194976 0.805024 0.790056 0.014968 NaN NaN
3 0.283631 -0.053466 0.337097 0.194976 0.805024 0.201061 0.798939 0.194976 0.006085 -0.001467 0.007551
4 0.473122 0.473122 0.000000 0.201061 0.798939 0.212307 0.787693 0.201061 0.011246 0.011246 0.000000
5 -0.414056 2.488650 -2.902706 0.212307 0.787693 0.211018 0.788982 0.212307 -0.001289 0.075263 -0.076551
6 0.072839 NaN NaN 0.211018 0.788982 0.212731 0.787269 0.211018 0.001713 NaN NaN
7 1.037462 1.065311 -0.027849 0.212731 0.787269 0.235407 0.764593 0.212731 0.022676 0.023817 -0.001140
8 0.131151 -0.165368 0.296519 0.235407 0.764593 0.242157 0.757843 0.235407 0.006750 -0.004869 0.011619
9 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.000000 NaN NaN

Dataframe after bfill method

In [17]:
data.fillna(method = 'bfill').head(10)
Out[17]:
A B C D E F G H I J K
0 -0.487497 3.939059 -4.771465 0.176801 0.823199 0.188485 0.811515 0.823199 -0.011685 0.073096 -0.094555
1 -0.832406 3.939059 -4.771465 0.188485 0.811515 0.209944 0.790056 0.811515 -0.021459 0.073096 -0.094555
2 0.315470 -0.053466 0.337097 0.209944 0.790056 0.194976 0.805024 0.790056 0.014968 -0.001467 0.007551
3 0.283631 -0.053466 0.337097 0.194976 0.805024 0.201061 0.798939 0.194976 0.006085 -0.001467 0.007551
4 0.473122 0.473122 0.000000 0.201061 0.798939 0.212307 0.787693 0.201061 0.011246 0.011246 0.000000
5 -0.414056 2.488650 -2.902706 0.212307 0.787693 0.211018 0.788982 0.212307 -0.001289 0.075263 -0.076551
6 0.072839 1.065311 -0.027849 0.211018 0.788982 0.212731 0.787269 0.211018 0.001713 0.023817 -0.001140
7 1.037462 1.065311 -0.027849 0.212731 0.787269 0.235407 0.764593 0.212731 0.022676 0.023817 -0.001140
8 0.131151 -0.165368 0.296519 0.235407 0.764593 0.242157 0.757843 0.235407 0.006750 -0.004869 0.011619
9 0.000000 1.220041 -1.607615 0.242157 0.757843 0.233609 0.766391 0.242157 0.000000 0.038459 -0.047007

In this case also, limitation is the last NaN value doesn’t get replaced. Here, the 10th row is having all filled cells only because the dataframe has more than 10 rows and the 11th row contains non-NaN values.
If you want the filling to be done column wise, you will have a specify the axis as 1 as axis = 0 by default and the filling takes row-wise in this case by default.

Original dataframe

In [18]:
data.head(10)
Out[18]:
A B C D E F G H I J K
0 -0.487497 NaN NaN 0.176801 0.823199 0.188485 0.811515 0.823199 -0.011685 NaN NaN
1 -0.832406 3.939059 -4.771465 0.188485 0.811515 0.209944 0.790056 0.811515 -0.021459 0.073096 -0.094555
2 0.315470 NaN NaN 0.209944 0.790056 0.194976 0.805024 0.790056 0.014968 NaN NaN
3 0.283631 -0.053466 0.337097 0.194976 0.805024 0.201061 0.798939 0.194976 0.006085 -0.001467 0.007551
4 0.473122 0.473122 0.000000 0.201061 0.798939 0.212307 0.787693 0.201061 0.011246 0.011246 0.000000
5 -0.414056 2.488650 -2.902706 0.212307 0.787693 0.211018 0.788982 0.212307 -0.001289 0.075263 -0.076551
6 0.072839 NaN NaN 0.211018 0.788982 0.212731 0.787269 0.211018 0.001713 NaN NaN
7 1.037462 1.065311 -0.027849 0.212731 0.787269 0.235407 0.764593 0.212731 0.022676 0.023817 -0.001140
8 0.131151 -0.165368 0.296519 0.235407 0.764593 0.242157 0.757843 0.235407 0.006750 -0.004869 0.011619
9 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.000000 NaN NaN

Column-wise bfill method

In [19]:
data.fillna(method='ffill', axis = 1).head(10)
Out[19]:
A B C D E F G H I J K
0 -0.487497 -0.487497 -0.487497 0.176801 0.823199 0.188485 0.811515 0.823199 -0.011685 -0.011685 -0.011685
1 -0.832406 3.939059 -4.771465 0.188485 0.811515 0.209944 0.790056 0.811515 -0.021459 0.073096 -0.094555
2 0.315470 0.315470 0.315470 0.209944 0.790056 0.194976 0.805024 0.790056 0.014968 0.014968 0.014968
3 0.283631 -0.053466 0.337097 0.194976 0.805024 0.201061 0.798939 0.194976 0.006085 -0.001467 0.007551
4 0.473122 0.473122 0.000000 0.201061 0.798939 0.212307 0.787693 0.201061 0.011246 0.011246 0.000000
5 -0.414056 2.488650 -2.902706 0.212307 0.787693 0.211018 0.788982 0.212307 -0.001289 0.075263 -0.076551
6 0.072839 0.072839 0.072839 0.211018 0.788982 0.212731 0.787269 0.211018 0.001713 0.001713 0.001713
7 1.037462 1.065311 -0.027849 0.212731 0.787269 0.235407 0.764593 0.212731 0.022676 0.023817 -0.001140
8 0.131151 -0.165368 0.296519 0.235407 0.764593 0.242157 0.757843 0.235407 0.006750 -0.004869 0.011619
9 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

If you closely look the first row, we can see that the last NaN values get replaced with -0.011685, the values just before the NaN values in column I. Same is the case with the NaN cells of first row in column B and C, theygot replaced with the value in column A, i.e., -0.487497

4. By using SimpleImputer() from sklearn.impute

For this, we will first import SimpleImputer class from sklearn.impute module.
In [23]:
data.head(10)
Out[23]:
A B C D E F G H I J K
0 -0.487497 NaN NaN 0.176801 0.823199 0.188485 0.811515 0.823199 -0.011685 NaN NaN
1 -0.832406 3.939059 -4.771465 0.188485 0.811515 0.209944 0.790056 0.811515 -0.021459 0.073096 -0.094555
2 0.315470 NaN NaN 0.209944 0.790056 0.194976 0.805024 0.790056 0.014968 NaN NaN
3 0.283631 -0.053466 0.337097 0.194976 0.805024 0.201061 0.798939 0.194976 0.006085 -0.001467 0.007551
4 0.473122 0.473122 0.000000 0.201061 0.798939 0.212307 0.787693 0.201061 0.011246 0.011246 0.000000
5 -0.414056 2.488650 -2.902706 0.212307 0.787693 0.211018 0.788982 0.212307 -0.001289 0.075263 -0.076551
6 0.072839 NaN NaN 0.211018 0.788982 0.212731 0.787269 0.211018 0.001713 NaN NaN
7 1.037462 1.065311 -0.027849 0.212731 0.787269 0.235407 0.764593 0.212731 0.022676 0.023817 -0.001140
8 0.131151 -0.165368 0.296519 0.235407 0.764593 0.242157 0.757843 0.235407 0.006750 -0.004869 0.011619
9 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.000000 NaN NaN
In [22]:
from sklearn.impute import SimpleImputer
#Inititalizing the object of SimpleImputer class
data_Imputer = SimpleImputer()
#Fitting and transforming the data using the data_Imputer object and then wrapping it around pd.DataFrame function to convert it into a dataframe from numpy array.
imputedData = pd.DataFrame(data_Imputer.fit_transform(data))
imputedData.head(10)
Out[22]:

The default value for this method is the mean value of imputation. That is, each NaN value of column A is filled with the mean of the non-NaN data points corresponding to column A.

There are few points to be noted here. When doing missing value imputation to continuous quantitative data, we should look at the histogram of a specific column.

  • If the distribution comes out to be normal (approximate to bell curve), we should do missing value imputation with mean.
  • If the distribution comes out to be skewed, we should do this with median.

For categorical data,

  • we should do missing value imputation with mode.

We will learn more about this when we’ll be studying statistics in the coming few days.

So guys, with this we conclude our tutorial on Handling missing values. In the next part of this Handling missing data series, we will dive little deeper and will use statistics to perform missing value imputation.

Also, have a look at our YouTube channel – Like, comment and Subscribe! Stay tuned!

Advertisements

One thought on “Handling missing values – Part 2

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: