Learning the Machine Learning, in a Human-friendly Way

Handling Missing Values in Python: Different Methods Explained with Visual Examples

In this post, we will discuss:
  • How to check for missing values
  • Different methods to handle missing values
Real life data sets often contain missing values. There is no single universally acceptable method to handle missing values. It is often left to the judgement of the data scientist to whether drop the missing values or to impute them. However, it is advisable to impute the missing values rather than deleting the entire row of observations which contains non-missing values for other features (columns).

In the example, I demonstrate now there are six columns and seven rows of data.


I imported this data set into python and all the missing values are denoted by NaN (Not-A-Number)


A) Checking for missing values


The following picture shows how to count total number of missing values in entire data set and how to get the count of missing values -column wise.

B) Handling missing values


1) Dropping the missing values


Before deleting the missing values, we should be know the following concept. There are three types of missing values:
  • Missing Completely at Random (MCAR)- ignorable
  • Missing at Random (MAR) - ignorable
  • Missing Not at Random (MNAR) - Not ignorable
To delete/ignore the missing values, it should not be of last type-MNAR. To understand more about these, I request you to read these interesting answers on stackexchange, especially the second answer by Mr. Wayne.

a) Dropping the row where there are missing values

This option should be used when other methods of handling the missing values are not useful. In our example, there was only a one row where there were no single missing values. So only that row was retained when we used dropna() function.


b) Dropping the entire row/column only when there are multiple missing values in the row

As we have seen, the last method of dropping the entire row even when there is only a single missing value is little harsh, we can specify a threshold number of non-missing values before deleting the row. Suppose we want to drop the drop only if there are less than say 2 non-missing values, then we case the following code:




c) Dropping the entire column
>>> cdata= dataset.drop(['Some column'], axis=1)

2) Imputing the missing values

a) Replacing with a given value


i) Replacing with a given number, let us say with 0.
>>> dataset['Some column']= dataset['Some column'].fillna(0)
ii) Replacing with a string, let us say with 'Mumbai'.
>>> dataset['Place']= dataset['Place'].fillna('Mumbai')
Replacing the missing values with a string could be useful where we want to treat missing values as a separate level.

b) Replacing with mean: It is the common method of imputing missing values. However in presence of outliers, this method may lead to erroneous imputations. In such cases, median is an appropriate measure of central tendency. For some reasons, if you have to use mean values for imputation, then treat the outliers before imputations.
>>> dataset['Height']=dataset['Height'].fillna((dataset['Height'].mean()))

c) Replacing with Median: As median is a position based measure of central tendency (middle most item), this method is not affected by presence of outliers.

>>> dataset['Weight']=dataset['Weight'].fillna((dataset['Weight'].median()))
d) Replacing with Mode:

Mode is the measure of central tendency for nominal scale data.

Replacing with mode is little bit trickier. Because unlike mean and median, mode returns a dataframe. Why? Because if there are two modal values, pandas will show both these values as modes.

For example, let us say our data set is ['A', 'A', 'B', 'C', 'C'].
Here both 'A' and 'C' are the modes as they are repeated equal number of times. Hence mode returns a dataframe containing 'A' and 'C' not a single value.

While replacing with mode, we need to use mode()[0] at the end as shown in the code below.
>>> dataset['Country']=dataset['Country'].fillna((dataset['Country'].mode()[0]))
e) Replacing with previous value - Forward fill

In time series data, replacing with nearby values will be more appropriate than replacing it with mean. Forward fill method fills the missing value with the previous value. For better understanding, I have shown the data column both before and after 'ffill'.
>>> dataset['Number of days'] = dataset['Number of days'].fillna(method='ffill')

f) Replacing with next value - Backward fill

Backward fill uses the next value to fill the missing value. You can see how it works in the following example.
>>> dataset['Number of days'] = dataset['Number of days'].fillna(method='bfill')

g) Replacing with average of previous and next value

In time series data, often the average of value of previous and next value will be a better estimate of the missing value. Use the following code to achieve this. I have shown in the following picture how this method works.
>>> dataset['Number of days'] = pd.concat([dataset['Number of days'].ffill(), dataset['Number of days'].bfill()]).groupby(level=0).mean()

h) Interpolation

Similar results can be achieved using interpolation. Interpolation is very flexible with different methods of interpolation such as the default 'linear' (average of ffill and bfill was similar to linear), 'quadratic', 'polynomial' methods (more about this).

>>> dataset['Number of days']=dataset['Number of days'].interpolate()
i) Model based imputation

We can impute the missing values using model based imputation methods. Popular being imputation using K-nearest neighbors (KNN) (Schmitt et al paper on Comparison of Six Methods for Missing Data Imputation).

KNN is useful in predicting missing values in both continuous and categorical data (we use Hamming distance here)

Even under Nearest neighbor based method, there are 3 approaches and they are given below (Tim's answer on stackechange):
If you are interested to how to run this KNN based imputation, you can click here for examples in Python and here for R.

Further reading:

https://datascienceplus.com/missing-value-treatment/  

Summary


We have seen different methods of handling missing values.

Do you have any questions or suggestions? Feel free to share, I will be happy to interact.