Categorical vs Categorical Heatmap
Identify high impact categorical variables in classification datasets
Classification datasets often have a number of categorical variables. There is always the need to select the more important categorical variables for modelling, especially in high-dimension datasets.
The crosstab
The crosstab(also known as contingency table) shows the relationship between frequencies of discrete values of two or more categorical features. Let us assume one of the categorical variables is the target variable. The crosstab allows us to observe how the marginal distribution (i.e. the distribution of the target variable ignoring all categorical features/conditions) gets perturbed by the values of a categorical variable.
The heatmap gives us a quick visual indication of the noteworthy aspects of the feature relationships.
Here is a code fragment taken from my Kaggle notebook showing the influence of 9 categorical variables on the target variable signifying hotel booking status. For the complete code in context refer to the Kaggle notebook.
# Code fragment ----------------------
fig, ax = plt.subplots(3,3,figsize=(20, 18))
axes = ax.flat
#Add the weekday feature to cat_feats
cat_feats.append('weekday')
for i, feat in enumerate(cat_feats):
sns.heatmap(pd.crosstab(reservedf['booking_status'],reservedf[feat],margins=True,normalize='columns'),
linewidths=1, annot=True, cmap='Blues', ax=axes[i], )
The margins=True
gives us the marginal distribution of the target variable
reservedf['booking_status']
and normalize='columns'
normalizes the
frequencies for quick interpretation.
The crosstab heatmap
The above code produces the following heatmaps. We get 9 heatmaps, one for each categorical feature with respect to the target.
Reading the heatmap
A crosstab heatmap is a graphical representation of a contingency table. Each cell of the heatmap shows the count (or relative frequency) of the intersection of two different categories in a dataset.
The way we have drawn the heatmap, normalized over columns, shows the distribution of each category value as a colour profile. The right-most strip in each heatmap shows the marginal distribution profile of the target variable.
All we have to do is compare the colour profiles of the values of the categorical variable (all vertical strips except the last one) with the target profile (the right-most strip). A notable deviation in the colour profile supported by the actual numbers indicates a categorical variable affecting the target.