Missing Value Analysis

Page content

Create a synthetic dataframe (with missing values)

from sklearn.datasets import make_classification
numfeat = 7
numrows = 100
sample_df = pd.DataFrame(make_classification(n_features = numfeat)[0]) #We just take the features part and discard the label part
sample_df.shape
(100, 7)
#Let us give the columns alphabetical names
colnames = [chr(0x41+i) for i in range(numfeat)]
sample_df.columns = colnames
print(sample_df)
           A         B         C         D         E         F         G
0  -0.127194  1.236139 -0.992594  0.642372 -0.301938 -0.338470       NaN
1  -0.756942 -0.379475  0.854811  0.297301 -0.100486  0.227557  0.939396
2  -1.133766 -1.350423 -1.142068  0.088902 -1.240435  0.326618  2.098233
3  -0.833053 -0.685877 -1.295095  0.204945 -1.278842       NaN  1.270936
4        NaN -0.340500  0.351410  0.447896  0.372048  0.260806  1.075568
..       ...       ...       ...       ...       ...       ...       ...
95 -1.440925  0.963312  2.160002  1.334175  0.953738  0.763906  0.298397
96 -1.182199 -1.223739 -0.269553  0.176725 -0.117301 -0.034381  2.024914
97 -0.783330  0.183404 -0.798328  0.570219  1.000458  0.414616  0.462967
98  0.720470 -0.963628       NaN -0.886744 -0.170301 -0.647375  0.276774
99 -3.419854 -0.969928 -0.721836  1.682513  1.072179  1.907760  3.586141

[100 rows x 7 columns]
#Having got this dataframe, let us pepper it with random np.nan values.
import random
import numpy as np
#list of rows and columns to drop values from
row_list = list(range(numrows)) 
col_list = list(range(numfeat))
random.seed(42) #to get reproducible results
#Since we have 2000 values, let us populate about 50 NaNs in the dataset.
for i in range(50):
    random_row = random.choice(row_list)
    random_col = random.choice(col_list)
    sample_df.iloc[random_row,random_col] = np.nan
print(sample_df)
           A         B         C         D         E         F         G
0  -0.127194  1.236139 -0.992594  0.642372 -0.301938 -0.338470       NaN
1  -0.756942 -0.379475  0.854811  0.297301 -0.100486  0.227557  0.939396
2  -1.133766 -1.350423 -1.142068  0.088902 -1.240435  0.326618  2.098233
3  -0.833053 -0.685877 -1.295095  0.204945 -1.278842       NaN  1.270936
4        NaN -0.340500  0.351410  0.447896  0.372048  0.260806  1.075568
..       ...       ...       ...       ...       ...       ...       ...
95 -1.440925  0.963312  2.160002  1.334175  0.953738  0.763906  0.298397
96 -1.182199 -1.223739 -0.269553  0.176725 -0.117301 -0.034381  2.024914
97 -0.783330  0.183404 -0.798328  0.570219  1.000458  0.414616  0.462967
98  0.720470 -0.963628       NaN -0.886744 -0.170301 -0.647375  0.276774
99 -3.419854 -0.969928 -0.721836  1.682513  1.072179  1.907760  3.586141

[100 rows x 7 columns]

Count the missing values

Which are the columns with the most missing values?

mvcount_df = pd.DataFrame(sample_df.isnull().sum().sort_values(ascending=False),
                          columns = ['mvcount'])
#We get a dataframe indexed on the feature name
mvcount_df['mvpc'] = mvcount_df['mvcount']/sample_df.shape[0] #get the missing value percentage
print(mvcount_df)                       
   mvcount  mvpc
E        9  0.09
F        9  0.09
G        8  0.08
A        7  0.07
B        7  0.07
D        5  0.05
C        4  0.04

Visualization as a bar graph

import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(1,1,figsize=(10,4))
sns.barplot(x= mvcount_df['mvcount'], y= mvcount_df.index, ax=ax)

<AxesSubplot:xlabel='mvcount'>

png

Visualization as a heatmap

import seaborn as sns
import matplotlib.pyplot as plt
fig,ax = plt.subplots(1,1,figsize=(12,4))
sns.heatmap(sample_df.isnull(), yticklabels=False, cbar=False, cmap='Blues', ax=ax)
<AxesSubplot:>

png

To drop or not to drop

Dropping or not dropping rows or columns is a decision that depends on the particular dataset and the purpose of your analysis. All we will do here is learn how to appreciate the implications of dropping rows or columns.

Let us drop the feature with most missing values and see how many rows we lose.

#From our example E is the one with the most missing values
#We will drop the column E and after that all rows with missing values
print(f"{sample_df.drop(['E'], axis=1).dropna().shape[0]} rows remain after the drop operation" )
65 rows remain after the drop operation

This shows that 65 rows remain after the previous operations. To get the number of rows dropped:

print("No. of rows dropped: ")
sample_df.shape[0] - sample_df.drop(['E'], axis=1).dropna().shape[0]
No. of rows dropped: 
35

Depending on the dataset or the purpose of the analysis, we could drop more columns and decide whether losing the data rows is still acceptable for the analysis.

Or we could impute values…, which would be a topic in itself.