Naive Bayes With Quantile Discretization

Discretization saves the day!

February 23, 2023

208 words/1 min read

Listen – this blog post explained

Often, classification datasets have a mix of continuous and categorical data. The continuous data typically have problems such as outliers, noise and lack of a defined distribution. Such quirky continous data makes it difficult to fit classifiers even to get a baseline classification.

Discretization of continuous features can bail us out from the problems caused by continuous data. Discretization is binning the continuous data into a defined number of bins. This converts the continuous data into categorical ordinal data that plays along very well with classifiers such as Random Forest and Categorical Naive Bayes.

The library sklearn offers us 3 kinds of discretization: Uniform(equally sized bins), Quantile(equal frequency bins), K-Means(based on centroids). You have to experiment with your modelling to decide the type of discretization and the number of bins.

The following code fragment shows the quantile based discretizer with 7 bins.

...
from sklearn.preprocessing import KBinsDiscretizer

#Discretizer works best in this case with quantile strategy, perhaps due 
#to the many outliers in the data.
discr = KBinsDiscretizer(n_bins=7, encode='ordinal', strategy='quantile')

#Use fit/fit_transform to transform your data
#or use it as part of a pipeline. 
...

Incidentally, discretization can also guard against overfitting. For the code in context, refer to my Kaggle notebook.