Don't stop me now: Rethinking Validation Criteria for Model Parameter Selection

Andrea Apicella; Francesco Isgrò; Andrea Pollastro; Roberto Prevete

Don't stop me now: Rethinking Validation Criteria for Model Parameter Selection

Andrea Apicella, Francesco Isgrò, Andrea Pollastro, Roberto Prevete

TL;DR

This work conducts a systematic empirical and statistical study of how the validation criterion used for model selection affects test performance in neural classifiers, with attention to early stopping.

Abstract

Despite the extensive literature on training loss functions, the evaluation of generalization on the validation set remains underexplored. In this work, we conduct a systematic empirical and statistical study of how the validation criterion used for model selection affects test performance in neural classifiers, with attention to early stopping. Using fully connected networks on standard benchmarks under $k$-fold evaluation, we compare: (i) early stopping with patience and (ii) post-hoc selection over all epochs (i.e. no early stopping). Models are trained with cross-entropy, C-Loss, or PolyLoss; the model parameter selection on the validation set is made using accuracy or one of the three loss functions, each considered independently. Three main findings emerge. (1) Early stopping based on validation accuracy performs worst, consistently selecting checkpoints with lower test accuracy than both loss-based early stopping and post-hoc selection. (2) Loss-based validation criteria yield comparable and more stable test accuracy. (3) Across datasets and folds, any single validation rule often underperforms the test-optimal checkpoint. Overall, the selected model typically achieves test-set performance statistically lower than the best performance across all epochs, regardless of the validation criterion. Our results suggest avoiding validation accuracy (in particular with early stopping) for parameter selection, favoring loss-based validation criteria.

Don't stop me now: Rethinking Validation Criteria for Model Parameter Selection

TL;DR

Abstract

-fold evaluation, we compare: (i) early stopping with patience and (ii) post-hoc selection over all epochs (i.e. no early stopping). Models are trained with cross-entropy, C-Loss, or PolyLoss; the model parameter selection on the validation set is made using accuracy or one of the three loss functions, each considered independently. Three main findings emerge. (1) Early stopping based on validation accuracy performs worst, consistently selecting checkpoints with lower test accuracy than both loss-based early stopping and post-hoc selection. (2) Loss-based validation criteria yield comparable and more stable test accuracy. (3) Across datasets and folds, any single validation rule often underperforms the test-optimal checkpoint. Overall, the selected model typically achieves test-set performance statistically lower than the best performance across all epochs, regardless of the validation criterion. Our results suggest avoiding validation accuracy (in particular with early stopping) for parameter selection, favoring loss-based validation criteria.

Paper Structure (14 sections, 12 equations, 8 figures, 2 tables)

This paper contains 14 sections, 12 equations, 8 figures, 2 tables.

Introduction
Related Work
Method
Notation
Post-hoc Checkpoint Selection versus Early Stopping
Statistical Comparison of Model Selection Criteria
Experimental assessment
Datasets
Models
Adopted losses
Training and Evaluation Protocol
Validation Criteria and Loss-Metric Combinations
Results and discussion
Conclusions

Figures (8)

Figure 1: An example of loss $\ell(D,e)$ and accuracy $\mathcal{a}(D,e)$ across $E$ epochs on a dataset $D$. Vertical dashed lines mark the epochs achieving the validation-loss minimum $e^\star_{\ell,D}$ and the validation-accuracy maximum $e^\star_{\mathcal{a},D}$; horizontal dotted lines indicate the corresponding values $L_D^\star=\min_e \ell(D,e)$ (blue) and $A_D^\star=\max_e \mathcal{a}(D,e)$ (orange).
Figure 2: An example comparing early stopping with patience $T$ and post-hoc checkpoint selection on the validation loss $\ell(\mathrm{Val},e)$. The orange dashed line marks the performance value returned by early stopping (best-so-far at $\hat{e}_{\ell,\mathrm{Val}}$, with training halted at $\hat{e}_{\ell,\mathrm{Val}}+T$), whereas the blue dashed line marks the best validation performance value $e^\star_{\ell,\mathrm{Val}}$ identified retrospectively. It is evident that the early-stopped checkpoint need not be the best-performing model: it corresponds to a local minimum reached before halting, whereas post-hoc checkpoint selection identifies the global minimum over all epochs.
Figure 3: An example showing, in a single panel, the validation trajectories (loss $\ell(\mathrm{Val},e)$ in blue and accuracy $\mathcal{a}(\mathrm{Val},e)$ in orange, left axis) together with the test accuracy trajectory $\mathcal{a}(\mathrm{Test},e)$ (green, right axis). Vertical dashed lines indicate the validation-selected epochs $e^\star_{\mathcal{a},\mathrm{Val}}$ and $e^\star_{\ell,\mathrm{Val}}$, as well as the test–optimal epoch $e^\star_{\mathcal{a},\mathrm{Test}}$. Horizontal dotted lines mark $L^\star_{\mathrm{Val}}$ and $A^\star_{\mathrm{Val}}$. The test accuracies achieved by the two validation–driven selections, $\mathcal{a}(\mathrm{Test},e^\star_{\ell,\mathrm{Val}})$ and $\mathcal{a}(\mathrm{Test},e^\star_{\mathcal{a},\mathrm{Val}})$, contrasted with the best achievable $\!A^\star_{\mathrm{Test}}$.
Figure 4: Graphical representation of the hypothesis testing results obtained using cross-entropy as the training objective and early stopping with patience $T=10$. Each heatmap reports the p-values obtained from hypothesis tests comparing the test accuracy of models selected using the validation set against the test-optimal accuracy $A^\star_{\mathrm{Test}}$ across cross-validation folds. From left to right, panels correspond to validation based on cross-entropy loss, C-Loss, Poly-1, and validation accuracy, respectively. Rows represent datasets and columns correspond to different parameter-to-sample ratios $r$. Datasets are ordered from top to bottom according to increasing linear separability, estimated using the generalized discrimination value (GDV).
Figure 5: Graphical representation of the hypothesis testing results obtained using cross-entropy as the training objective and early stopping with patience $T=50$. Each heatmap reports the p-values obtained from hypothesis tests comparing the test accuracy of models selected using the validation set against the test-optimal accuracy $A^\star_{\mathrm{Test}}$ across cross-validation folds. From left to right, panels correspond to validation based on cross-entropy loss, C-Loss, Poly-1, and validation accuracy, respectively. Rows represent datasets and columns correspond to different parameter-to-sample ratios $r$. Datasets are ordered from top to bottom according to increasing linear separability, estimated using the generalized discrimination value (GDV).
...and 3 more figures

Don't stop me now: Rethinking Validation Criteria for Model Parameter Selection

TL;DR

Abstract

Don't stop me now: Rethinking Validation Criteria for Model Parameter Selection

Authors

TL;DR

Abstract

Table of Contents

Figures (8)