Handling missing values – Part 2

Hi ML Enthusiasts! In this tutorial, we will be continuing our tutorial on how to handle missing data. Before giving this a read, please be sure to go through our previous article on the same topic, Handling missing values – Part 1.

In this part of handling missing values, we will be going little advanced and will be discussing different ways which can be used to deal with missing values. We will be talking about fillna() function and will be using it to discuss various methods of missing value imputation.

Importing the libraries

The first step will be to import the libraries.

import numpy as np
import pandas as pd

Importing the data

Now, let’s import the data. I have already prepared the data in the form of a csv file having missing values in it. Let’s use read_csv() function of pandas to read it. The output of read_csv() function is a dataframe. In the code below, we will storing the output of read_csv function in a variable “data”. After this, data will be our dataframe on which we will be operating.

data = pd.read_csv("Data.csv")

As can be seen from above, the dataframe has many NaN’s or missing values.
Let’s check first how many missing values are there in the dataframe by first applying isnull() function on it and then applying sum() function on it. Please note that sum() counts only True values corresponding to each column.

missing_values = data.isnull().sum()
missing_values

A     0
B    30
C    30
D     2
E     2
F     2
G     2
H     2
I     0
J    30
K    30
dtype: int64

Finding percentage of missing values

Looking at the count above, we can see that we have missing values in almost every column. Now, let’s find out the percentage of missing values in our dataset.

 total_rows = data.shape[0]
 total_columns = data.shape[1]
 total_cells = total_rows * total_columns
 #Total cells in original dataframe is:
 total_cells

561

total_missing_Values_in_each_column = missing_values.sum()
total_missing_Values_in_each_column

130

percent_of_missing_values = round((total_missing_Values_in_each_column/total_cells)*100, 2)
percent_of_missing_values

23.17

Thus, 23.17% of the data contains missing values.

Let’s now discuss how to use fillna() function to handle this 23.17% of data.

1. Replacing with 0

Though not a great one, one option is to replace all missing values with 0.

data.fillna(0).head(10)

As can be seen above, all missing values got replaced with 0.

2. Forward fill method

In forward fill method, all the NaNs get replaced with the number just previous to them. We can apply this by passing ‘ffill’ argument to method parameter of fillna function.

Original dataframe

data.head(10)

data.fillna(method = 'ffill').head(10)

If you look at column B closely in original as well as forward filled dataframe, you’ll see in original dataframe, column B has NaN at 3rd row and in forward imputed dataframe, it got replaced with 3.939059, the value just previous to it. The only limitation with forward fill method is that the topmost or first NaN value remains as it is – it doesn’t get replaced.

3. Backward fill method

In this method, the NaN value gets replaced with the non-NaN value just next to it.

data.head(10)

data.fillna(method = 'bfill').head(10)

In this case also, limitation is the last NaN value doesn’t get replaced. Here, the 10th row is having all filled cells only because the dataframe has more than 10 rows and the 11th row contains non-NaN values.
If you want the filling to be done column wise, you will have a specify the axis as 1 as axis = 0 by default and the filling takes row-wise in this case by default.

data.head(10)

data.fillna(method='ffill', axis = 1).head(10)

If you closely look the first row, we can see that the last NaN values get replaced with -0.011685, the values just before the NaN values in column I. Same is the case with the NaN cells of first row in column B and C, theygot replaced with the value in column A, i.e., -0.487497

4. By using SimpleImputer() from sklearn.impute

For this, we will first import SimpleImputer class from sklearn.impute module.

data.head(10)

from sklearn.impute import SimpleImputer
#Inititalizing the object of SimpleImputer class
data_Imputer = SimpleImputer()
#Fitting and transforming the data using the data_Imputer object and then wrapping it around pd.DataFrame function to convert it into a dataframe from numpy array.
imputedData = pd.DataFrame(data_Imputer.fit_transform(data))
imputedData.head(10)

The default value for this method is the mean value of imputation. That is, each NaN value of column A is filled with the mean of the non-NaN data points corresponding to column A.

There are few points to be noted here. When doing missing value imputation to continuous quantitative data, we should look at the histogram of a specific column.

If the distribution comes out to be normal (approximate to bell curve), we should do missing value imputation with mean.
If the distribution comes out to be skewed, we should do this with median.

For categorical data,

we should do missing value imputation with mode.

We will learn more about this when we’ll be studying statistics in the coming few days.

So guys, with this we conclude our tutorial on Handling missing values. In the next part of this Handling missing data series, we will dive little deeper and will use statistics to perform missing value imputation.

Also, have a look at our YouTube channel – Like, comment and Subscribe! Stay tuned!

	A	B	C	D	E	F	G	H	I	J	K
0	-0.487497	0.000000	0.000000	0.176801	0.823199	0.188485	0.811515	0.823199	-0.011685	0.000000	0.000000
1	-0.832406	3.939059	-4.771465	0.188485	0.811515	0.209944	0.790056	0.811515	-0.021459	0.073096	-0.094555
2	0.315470	0.000000	0.000000	0.209944	0.790056	0.194976	0.805024	0.790056	0.014968	0.000000	0.000000
3	0.283631	-0.053466	0.337097	0.194976	0.805024	0.201061	0.798939	0.194976	0.006085	-0.001467	0.007551
4	0.473122	0.473122	0.000000	0.201061	0.798939	0.212307	0.787693	0.201061	0.011246	0.011246	0.000000
5	-0.414056	2.488650	-2.902706	0.212307	0.787693	0.211018	0.788982	0.212307	-0.001289	0.075263	-0.076551
6	0.072839	0.000000	0.000000	0.211018	0.788982	0.212731	0.787269	0.211018	0.001713	0.000000	0.000000
7	1.037462	1.065311	-0.027849	0.212731	0.787269	0.235407	0.764593	0.212731	0.022676	0.023817	-0.001140
8	0.131151	-0.165368	0.296519	0.235407	0.764593	0.242157	0.757843	0.235407	0.006750	-0.004869	0.011619
9	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000

	A	B	C	D	E	F	G	H	I	J	K
0	-0.487497	NaN	NaN	0.176801	0.823199	0.188485	0.811515	0.823199	-0.011685	NaN	NaN
1	-0.832406	3.939059	-4.771465	0.188485	0.811515	0.209944	0.790056	0.811515	-0.021459	0.073096	-0.094555
2	0.315470	NaN	NaN	0.209944	0.790056	0.194976	0.805024	0.790056	0.014968	NaN	NaN
3	0.283631	-0.053466	0.337097	0.194976	0.805024	0.201061	0.798939	0.194976	0.006085	-0.001467	0.007551
4	0.473122	0.473122	0.000000	0.201061	0.798939	0.212307	0.787693	0.201061	0.011246	0.011246	0.000000
5	-0.414056	2.488650	-2.902706	0.212307	0.787693	0.211018	0.788982	0.212307	-0.001289	0.075263	-0.076551
6	0.072839	NaN	NaN	0.211018	0.788982	0.212731	0.787269	0.211018	0.001713	NaN	NaN
7	1.037462	1.065311	-0.027849	0.212731	0.787269	0.235407	0.764593	0.212731	0.022676	0.023817	-0.001140
8	0.131151	-0.165368	0.296519	0.235407	0.764593	0.242157	0.757843	0.235407	0.006750	-0.004869	0.011619
9	0.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN

	A	B	C	D	E	F	G	H	I	J	K
0	-0.487497	NaN	NaN	0.176801	0.823199	0.188485	0.811515	0.823199	-0.011685	NaN	NaN
1	-0.832406	3.939059	-4.771465	0.188485	0.811515	0.209944	0.790056	0.811515	-0.021459	0.073096	-0.094555
2	0.315470	3.939059	-4.771465	0.209944	0.790056	0.194976	0.805024	0.790056	0.014968	0.073096	-0.094555
3	0.283631	-0.053466	0.337097	0.194976	0.805024	0.201061	0.798939	0.194976	0.006085	-0.001467	0.007551
4	0.473122	0.473122	0.000000	0.201061	0.798939	0.212307	0.787693	0.201061	0.011246	0.011246	0.000000
5	-0.414056	2.488650	-2.902706	0.212307	0.787693	0.211018	0.788982	0.212307	-0.001289	0.075263	-0.076551
6	0.072839	2.488650	-2.902706	0.211018	0.788982	0.212731	0.787269	0.211018	0.001713	0.075263	-0.076551
7	1.037462	1.065311	-0.027849	0.212731	0.787269	0.235407	0.764593	0.212731	0.022676	0.023817	-0.001140
8	0.131151	-0.165368	0.296519	0.235407	0.764593	0.242157	0.757843	0.235407	0.006750	-0.004869	0.011619
9	0.000000	-0.165368	0.296519	0.235407	0.764593	0.242157	0.757843	0.235407	0.000000	-0.004869	0.011619

	A	B	C	D	E	F	G	H	I	J	K
0	-0.487497	NaN	NaN	0.176801	0.823199	0.188485	0.811515	0.823199	-0.011685	NaN	NaN
1	-0.832406	3.939059	-4.771465	0.188485	0.811515	0.209944	0.790056	0.811515	-0.021459	0.073096	-0.094555
2	0.315470	NaN	NaN	0.209944	0.790056	0.194976	0.805024	0.790056	0.014968	NaN	NaN
3	0.283631	-0.053466	0.337097	0.194976	0.805024	0.201061	0.798939	0.194976	0.006085	-0.001467	0.007551
4	0.473122	0.473122	0.000000	0.201061	0.798939	0.212307	0.787693	0.201061	0.011246	0.011246	0.000000
5	-0.414056	2.488650	-2.902706	0.212307	0.787693	0.211018	0.788982	0.212307	-0.001289	0.075263	-0.076551
6	0.072839	NaN	NaN	0.211018	0.788982	0.212731	0.787269	0.211018	0.001713	NaN	NaN
7	1.037462	1.065311	-0.027849	0.212731	0.787269	0.235407	0.764593	0.212731	0.022676	0.023817	-0.001140
8	0.131151	-0.165368	0.296519	0.235407	0.764593	0.242157	0.757843	0.235407	0.006750	-0.004869	0.011619
9	0.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN

	A	B	C	D	E	F	G	H	I	J	K
0	-0.487497	3.939059	-4.771465	0.176801	0.823199	0.188485	0.811515	0.823199	-0.011685	0.073096	-0.094555
1	-0.832406	3.939059	-4.771465	0.188485	0.811515	0.209944	0.790056	0.811515	-0.021459	0.073096	-0.094555
2	0.315470	-0.053466	0.337097	0.209944	0.790056	0.194976	0.805024	0.790056	0.014968	-0.001467	0.007551
3	0.283631	-0.053466	0.337097	0.194976	0.805024	0.201061	0.798939	0.194976	0.006085	-0.001467	0.007551
4	0.473122	0.473122	0.000000	0.201061	0.798939	0.212307	0.787693	0.201061	0.011246	0.011246	0.000000
5	-0.414056	2.488650	-2.902706	0.212307	0.787693	0.211018	0.788982	0.212307	-0.001289	0.075263	-0.076551
6	0.072839	1.065311	-0.027849	0.211018	0.788982	0.212731	0.787269	0.211018	0.001713	0.023817	-0.001140
7	1.037462	1.065311	-0.027849	0.212731	0.787269	0.235407	0.764593	0.212731	0.022676	0.023817	-0.001140
8	0.131151	-0.165368	0.296519	0.235407	0.764593	0.242157	0.757843	0.235407	0.006750	-0.004869	0.011619
9	0.000000	1.220041	-1.607615	0.242157	0.757843	0.233609	0.766391	0.242157	0.000000	0.038459	-0.047007

Handling missing values – Part 2

Handling missing values – Part 2

Importing the libraries

Importing the data

Handling missing values – Fillna() function

Finding percentage of missing values

Handling missing values methods

1. Replacing with 0

2. Forward fill method

Original dataframe

Dataframe after ffill method

3. Backward fill method

Original dataframe

Dataframe after bfill method

Original dataframe

Column-wise bfill method

4. By using SimpleImputer() from sklearn.impute

Like this:

Related

2 thoughts on “Handling missing values – Part 2”

Leave a ReplyCancel reply

Help Stray Dogs! Donate Now!!

Handling missing values – Part 2

Importing the libraries

Importing the data

Handling missing values – Fillna() function

Finding percentage of missing values

Handling missing values methods

1. Replacing with 0

2. Forward fill method

Original dataframe

Dataframe after ffill method

3. Backward fill method

Original dataframe

Dataframe after bfill method

Original dataframe

Column-wise bfill method

4. By using SimpleImputer() from sklearn.impute

Share this post:

Like this:

Related

2 thoughts on “Handling missing values – Part 2”

Leave a ReplyCancel reply

Related Posts

Discover more from Machine Learning For Analytics