The Validation Curve

Tells you where to get off

The Validation Curve

Consider the typical use case of a boosted classification model such as Adaboost. Adaboost is known to overfit on noisy data. So, how do you determine how much boosting is too much boosting?

Enter the validation curve. The sklearn validation curve helps us plot the training and the cross validation scores against a specified model parameter. In our case, we vary n_estimators– the number of boosting base estimators.

The accompanying plot shows us that the training and validation curves diverge at about 10 boosting estimators. This indicates that our model would overfit if we went beyond this value.

Below is the source code fragment (adapted from the sklearn example) to generate the plot. To see the code in context, refer to this Kaggle notebook.

The validation curve is not to be confused with the learning curve which plots the training and validation scores against training sample size. The learning curve helps determine if an already ‘tuned’ model is overfitting.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
from sklearn.model_selection import validation_curve  
#We will plot the Adaboost validation curve by varying the number of estimators.

param_range = [2,5,10,20,50,100,500]  #Range of n_estimators
train_scores, test_scores = validation_curve(
    AdaBoostClassifier(),
    X_train,
    y_train,
    param_name="n_estimators",
    param_range=param_range,
    scoring="accuracy",
    n_jobs=-1,
)

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.title("Validation Curve with Adaboost")
plt.xlabel(r"estimators")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
lw = 2
plt.semilogx(
    param_range, train_scores_mean, label="Training score", color="darkorange", lw=lw
)
plt.fill_between(
    param_range,
    train_scores_mean - train_scores_std,
    train_scores_mean + train_scores_std,
    alpha=0.2,
    color="darkorange",
    lw=lw,
)
plt.semilogx(
    param_range, test_scores_mean, label="Cross-validation score", color="navy", lw=lw
)
plt.fill_between(
    param_range,
    test_scores_mean - test_scores_std,
    test_scores_mean + test_scores_std,
    alpha=0.2,
    color="navy",
    lw=lw,
)
plt.legend(loc="best")
plt.show()

The highlighted lines indicate how we sweep the n_estimators parameter across a range of values to determine the point at which the training and validation accuracies begin to diverge (and hence the model begins to overfit).