Advertisements
New Delhi, India

Handling missing values – Part 1

Handling missing values – Part 1

Hi ML Enthusiasts! Today, we will learn handling missing values in Python using pandas and numpy. If you are new to this, we would advise you to first go through our introductory tutorials on both these libraries: Introduction to NumPy, Introduction to Pandas, Basics of NumPy arrays – Part 1 and Basics of NumPy Arrays Part – 2.

Dynamics behind None and NaN

There are two ways to denote missing values in Python: None and NaN (Not a Number). Let’s examine them, but, before doing that, let’s first import the libraries:

In [0]:
import pandas as pd
import numpy as np

None has object data type and if it’s included in any array, all the elements of that array get converted to object data type only. For example:

In [2]:
np.array([None])
Out[2]:
array([None], dtype=object)
In [3]:
np.array([0.5, 5, 9.5, None])
Out[3]:
array([0.5, 5, 9.5, None], dtype=object)
In [5]:
np.array([0.5, 5, 9.5]).dtype
Out[5]:
dtype('float64')

As you would have noticed from the above examples, the numpy arrays get converted to the highest level data type. With inclusion of None as one of the array elements, the array got converted to object data type. It is advisable not to use None much as it takes lot of time to get executed especially in loop. Numpy arrays based on native data types generally take very less time, but, they don’t perform that well with object data type.

With None included in your data, you won’t be able to perform even basic calculations like sum, min max etc.

In [6]:
np.array([0.5, 5, 9.5, None]).sum()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-00832b7d6dec> in <module>()
----> 1 np.array([0.5, 5, 9.5, None]).sum()

/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py in _sum(a, axis, dtype, out, keepdims, initial, where)
     36 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
     37          initial=_NoValue, where=True):
---> 38     return umr_sum(a, axis, dtype, out, keepdims, initial, where)
     39 
     40 def _prod(a, axis=None, dtype=None, out=None, keepdims=False,

TypeError: unsupported operand type(s) for +: 'float' and 'NoneType'
In [7]:
np.array([0.5, 5, 9.5, None]).min()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-dc6dbd26468d> in <module>()
----> 1 np.array([0.5, 5, 9.5, None]).min()

/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py in _amin(a, axis, out, keepdims, initial, where)
     32 def _amin(a, axis=None, out=None, keepdims=False,
     33           initial=_NoValue, where=True):
---> 34     return umr_minimum(a, axis, None, out, keepdims, initial, where)
     35 
     36 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,

TypeError: '<=' not supported between instances of 'float' and 'NoneType'
In [8]:
np.array([0.5, 5, 9.5, None]).max()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-442472594ea8> in <module>()
----> 1 np.array([0.5, 5, 9.5, None]).max()

/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py in _amax(a, axis, out, keepdims, initial, where)
     28 def _amax(a, axis=None, out=None, keepdims=False,
     29           initial=_NoValue, where=True):
---> 30     return umr_maximum(a, axis, None, out, keepdims, initial, where)
     31 
     32 def _amin(a, axis=None, out=None, keepdims=False,

TypeError: '>=' not supported between instances of 'float' and 'NoneType'

Handling missing values with None included can make life little messy. But, don’t worry! We have other way to represent them and that is NaN (Not a Number) that coems to our rescue!

In [14]:
np.array([np.nan])
Out[14]:
array([nan])
In [11]:
np.array([np.nan]).dtype
Out[11]:
dtype('float64')
In [13]:
np.array([1, 2, 2.5, np.nan])
Out[13]:
array([1. , 2. , 2.5, nan])
In [15]:
np.array([1, 2, 2.5, np.nan]).dtype
Out[15]:
dtype('float64')

Thus, from above exapmles, we can see that the default data type of NaN is float64v which is one of the native data type of numpy and this makes the manipulations and calculations really faster in case of NaN than that of None (which has object data type).

NaN converts everything into NaN in which it’s included

In [16]:
6 + 7 + 8 + np.nan
Out[16]:
nan
In [17]:
0 - np.nan
Out[17]:
nan
In [18]:
np.array([1, 2, 2.5, np.nan]).sum()
Out[18]:
nan
In [19]:
np.array([1, 2, 2.5, np.nan]).min()
Out[19]:
nan
In [20]:
np.array([1, 2, 2.5, np.nan]).max()
Out[20]:
nan

You may be thinking that it’s just like coronavirus! It infects everything it comes in contact with! So, what’s the way to come out of this? Well, python has ways out of every problem!

In [21]:
np.nansum(np.array([1, 2, 2.5, np.nan]))
Out[21]:
5.5
In [22]:
np.nanmax(np.array([1, 2, 2.5, np.nan]))
Out[22]:
2.5
In [23]:
np.nanmin(np.array([1, 2, 2.5, np.nan]))
Out[23]:
1.0

Functions for handling missing values

Now, let’s learn handling on missing values or null values. Following functions are used for this purpose:

  • isnull(): gives True/False as out depending on the presence of null values. If missing values are present, True values are returned. Else, False values are returned.
  • notnull(): It’s the opposite of isnull()
  • dropna(): This returns a list of all items with missing values excluded.
  • fillna(): Function used for the purpose of missing value imputation – i.e. – replacing missing values with mean, median, mode, etc of the specific column to which the missing values belong.

Handling missing values – Detection

First, let’s learn how to detect missing values

In [24]:
# Let's pass a numpy array having missing value in pd.DataFrame function and then apply isnull() function on it.
pd.DataFrame(np.array([1, 2, 2.5, np.nan])).isnull()
Out[24]:
0
0 False
1 False
2 False
3 True

As can be seen, out of four values, the first three return False and last one returns True. Now, let’s apply notnull() and see what its outcome is!

In [25]:
# Let's pass a numpy array having missing value in pd.DataFrame function and then apply notnull() function on it.
pd.DataFrame(np.array([1, 2, 2.5, np.nan])).notnull()
Out[25]:
0
0 True
1 True
2 True
3 False

As was expected, the opposite of isnull is returned. Out of four values, first three return True and last one False.

Now, let’s fetch the subset of not null values.

In [31]:
#Below code returns all values of df which are not null
df = pd.DataFrame(np.array([1, 2, 2.5, np.nan]))
df[df[0].notnull()]
Out[31]:
0
0 1.0
1 2.0
2 2.5

 

Handling missing values – dropping

Now, let’s learn how to drop null values using dropna() function.

In [33]:
df.dropna()
Out[33]:
0
0 1.0
1 2.0
2 2.5

This drops all the rows/columns having na values.

In [38]:
df = pd.DataFrame([[1, 2, np.nan], 
                   [7, 5, 3],
                   [np.nan, 1, 34]])
df
Out[38]:
0 1 2
0 1.0 2 NaN
1 7.0 5 3.0
2 NaN 1 34.0
In [39]:
df.dropna()
Out[39]:
0 1 2
1 7.0 5 3.0

By default, dropna drops rows having na values. To turn this into columns, we have to pass ‘columns’ to axis parameter in dropna function. By default, ‘rows’ are passed as argument to axis parameter.

In [40]:
df.dropna(axis='columns')
Out[40]:
1
0 2
1 5
2 1

Now, let’s examine the how parameter of dropna function. But first, let’s introduce one more column to df function having only NaN values.

In [41]:
df[3] = np.nan
df
Out[41]:
0 1 2 3
0 1.0 2 NaN NaN
1 7.0 5 3.0 NaN
2 NaN 1 34.0 NaN

Now, let’s see how passing different arguments in how parameter can impact the output. By defualt, ‘any’ is passed to how parameter. Let’s examine the documentation of dropna first.

In [0]:
df.dropna?

Signature: df.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)
Docstring:
Remove missing values.

See the :ref:User Guide <missing_data> for more on which values are
considered missing, and how to work with missing data.

Parameters

axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Determine if rows or columns which contain missing values are
removed.

* 0, or 'index' : Drop rows which contain missing values.
* 1, or 'columns' : Drop columns which contain missing value.

.. deprecated:: 0.23.0

   Pass tuple or list to drop on multiple axes.
   Only a single axis is allowed.

how : {‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have
at least one NA or all NA.

* 'any' : If any NA values are present, drop that row or column.
* 'all' : If all values are NA, drop that row or column.

thresh : int, optional
Require that many non-NA values.
subset : array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
inplace : bool, default False
If True, do operation inplace and return None.

Returns

DataFrame
DataFrame with NA entries dropped from it.

 

In [42]:
df.dropna(axis='columns')
Out[42]:
1
0 2
1 5
2 1
In [44]:
df.dropna(axis='columns', how='all')
Out[44]:
0 1 2
0 1.0 2 NaN
1 7.0 5 3.0
2 NaN 1 34.0

Thus, the column having ‘all’ NaN values is dropped and ones having even one not null is not dropped. The parameter thresh specify the minimum number of not null values to be returned which is 3 in the case below:

In [46]:
df.dropna(thresh = 3)
Out[46]:
0 1 2 3
1 7.0 5 3.0 NaN

So, guys with this I conclude this tutorial Handling missing values – Part 1. In the part 2 of this tutorial, we will see the ways which can be used to fill missing values or how to do missing value imputation. Stay tuned!

 

 

Advertisements

One thought on “Handling missing values – Part 1

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: