Calibrate Your Classifier

you might wonder how you were doing without it all along

Page content

Calibration is easy enough to understand in the context of a weighing scale. How would you map it to the world of machine learning?

Understanding calibration

Let’s take an example from the world of weather forecasting. Suppose we have a weather forecast with 40% chance of rain – a probability of 0.4. So considering the weather forecast is a classifier predicting ‘yes’ or ‘no’ rain, is it well calibrated? Now, you cannot check the calibration with just one day’s forecast. You need to gather data of all forecasts with 40% rain and check whether it really rained on about 40% of those days. If it did then the prediction agrees with the empirical observation. If you broaden this exercise to include forecasts of 10 to 90% rain and plot them against empirical observations, what you have is a calibration curve for the forecast. The better the agreement between the forecast and the empirical observation, the better calibrated the forecast is.

Why is calibration important?

Calibration for classifiers is important in at least the following cases:

  • When it is important to assign a probability to an observation. Examples are: probability of incidence of a condition.
  • When stacking classifiers, it is important that the upstream classifiers and indeed the downstream ones as well are well calibrated.
  • For a more precise precision-recall tradeoff.
  • Even while predicting labels – where the classifier uses a default probability threshold of 0.5 – an uncalibrated classifier can show very poor performance.

Most classifiers come uncalibrated out-of-the-box for various reasons. Tree-based classifiers use entropy. Bayesian classifiers use the simplifying assumption of feature-independence, while an SVM doesn’t inherently give a probability score.

Classifier calibration in action

We examine classifier calibration in the context of a travel-customer-churn dataset. We share the code fragments only. For the complete code in context, refer to my Kaggle notebook.

precalibration curve

The plot on the left reveals how far the classifier predictions are from the ideal calibration line. The predictions are either too pessimistic or optimistic. The histogram on the right shows an underconfident classifier that is shy of predicting higher probability values for the positive case.

The process of calibration maps the classifier predictions (treating the classifier as a black-box) so as to move the calibration curve closer to the ideal.

...
from sklearn.calibration import calibration_curve
...
from scikeras.wrappers import KerasClassifier
# Wrap the UNFITTED model to give us a scikit model
sk_model = KerasClassifier(model=create_model, epochs=20, batch_size=10, verbose=0)

# Import the calibration infrastructure from sklearn
from sklearn.calibration import CalibratedClassifierCV, CalibrationDisplay

# We use sigmoid calibration
sk_model_sigmoid = CalibratedClassifierCV(sk_model, cv=2, method="sigmoid")

# Fit the model, remember this is an sklearn model
sk_model_sigmoid.fit(X_train,y_train)

# Get the prediction probabilities
# NOTE: THESE ARE PROBABILITIES, NOT LABELS
calibrated_y_pred_proba = sk_model_sigmoid.predict_proba(X_test)[:,1]

# reliability diagram eop-empirical observed probability, mpp-mean predicted probability
eop, mpp = calibration_curve(y_test, calibrated_y_pred_proba, n_bins=10)

fig, ax = plt.subplots(1,2,figsize=(10,6))

# plot perfectly calibrated
ax[0].plot([0, 1], [0, 1], label="ideal", linestyle='--')
# plot model reliability
ax[0].plot(mpp, eop, marker='.', label="classifier")
ax[0].set(title="Classifier reliability curve (calibrated)", 
          xlabel="mean classifier score", 
          ylabel="empirical frequency")
ax[0].legend()
ax[0].grid()

# plot probabilities histogram
ax[1].hist(calibrated_y_pred_proba,bins=10)
ax[1].set(title="Predicted probability distribution (calibrated)", 
       xlabel="Mean predicted probability", 
       ylabel="Count")
ax[1].grid()

plt.tight_layout()
plt.show()
...

In the code fragment above, we use sklearn infrastructure to return a calibrated classifier using sigmoid calibration.

Note how we wrap keras model using scikeras adapter to conform to the sklearn interface.

postcalibration curve

The post-calibration plot shows a better behaved classifier whose probabilities are closer to the ideal than its uncalibrated version. The histogram on the right shows a still underconfident classifier.

Takeaway

You might now agree that classifier calibration is a very important step to ensure more reliable predictions. Fortunately, it is easy with the sklearn library. Indeed, calibration needs to be a step before evaluation as an attempt to bring the best out of your classifier which you have so dilligently trained.