Table of Contents
Fetching ...

Don't Waste Your Time: Early Stopping Cross-Validation

Edward Bergman, Lennart Purucker, Frank Hutter

TL;DR

This paper investigates reducing the computational burden of cross-validation in AutoML for tabular data by introducing two simple early stopping strategies for inner cross-validation during model selection. Through an extensive study on MLP and RF across 36 datasets and multiple fold configurations, the authors show that a forgiving early stopping approach consistently speeds up convergence and expands the explored search space, often improving overall performance, while aggressive stopping can be unreliable. They further explore the approach with Bayesian optimization and repeated cross-validation, finding that Forgiving generally maintains gains and can outperform No ES under several conditions. The work contributes a practical, easy-to-implement framework for early stopping cross-validation, analyzes its interaction with common AutoML workflows, and highlights future research directions, including integration into BO and multi-fidelity approaches. Overall, the findings suggest that simple, robust early stopping can significantly enhance model selection efficiency in AutoML without compromising performance in many scenarios, with potential implications for efficiency and sustainability in AI systems.

Abstract

State-of-the-art automated machine learning systems for tabular data often employ cross-validation; ensuring that measured performances generalize to unseen data, or that subsequent ensembling does not overfit. However, using k-fold cross-validation instead of holdout validation drastically increases the computational cost of validating a single configuration. While ensuring better generalization and, by extension, better performance, the additional cost is often prohibitive for effective model selection within a time budget. We aim to make model selection with cross-validation more effective. Therefore, we study early stopping the process of cross-validation during model selection. We investigate the impact of early stopping on random search for two algorithms, MLP and random forest, across 36 classification datasets. We further analyze the impact of the number of folds by considering 3-, 5-, and 10-folds. In addition, we investigate the impact of early stopping with Bayesian optimization instead of random search and also repeated cross-validation. Our exploratory study shows that even a simple-to-understand and easy-to-implement method consistently allows model selection to converge faster; in ~94% of all datasets, on average by ~214%. Moreover, stopping cross-validation enables model selection to explore the search space more exhaustively by considering +167% configurations on average within one hour, while also obtaining better overall performance.

Don't Waste Your Time: Early Stopping Cross-Validation

TL;DR

This paper investigates reducing the computational burden of cross-validation in AutoML for tabular data by introducing two simple early stopping strategies for inner cross-validation during model selection. Through an extensive study on MLP and RF across 36 datasets and multiple fold configurations, the authors show that a forgiving early stopping approach consistently speeds up convergence and expands the explored search space, often improving overall performance, while aggressive stopping can be unreliable. They further explore the approach with Bayesian optimization and repeated cross-validation, finding that Forgiving generally maintains gains and can outperform No ES under several conditions. The work contributes a practical, easy-to-implement framework for early stopping cross-validation, analyzes its interaction with common AutoML workflows, and highlights future research directions, including integration into BO and multi-fidelity approaches. Overall, the findings suggest that simple, robust early stopping can significantly enhance model selection efficiency in AutoML without compromising performance in many scenarios, with potential implications for efficiency and sustainability in AI systems.

Abstract

State-of-the-art automated machine learning systems for tabular data often employ cross-validation; ensuring that measured performances generalize to unseen data, or that subsequent ensembling does not overfit. However, using k-fold cross-validation instead of holdout validation drastically increases the computational cost of validating a single configuration. While ensuring better generalization and, by extension, better performance, the additional cost is often prohibitive for effective model selection within a time budget. We aim to make model selection with cross-validation more effective. Therefore, we study early stopping the process of cross-validation during model selection. We investigate the impact of early stopping on random search for two algorithms, MLP and random forest, across 36 classification datasets. We further analyze the impact of the number of folds by considering 3-, 5-, and 10-folds. In addition, we investigate the impact of early stopping with Bayesian optimization instead of random search and also repeated cross-validation. Our exploratory study shows that even a simple-to-understand and easy-to-implement method consistently allows model selection to converge faster; in ~94% of all datasets, on average by ~214%. Moreover, stopping cross-validation enables model selection to explore the search space more exhaustively by considering +167% configurations on average within one hour, while also obtaining better overall performance.
Paper Structure (21 sections, 2 equations, 4 figures, 2 tables)

This paper contains 21 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Speedup Overview Per Dataset for MLP with 10 folds: The time point of matching the best performance of No ES for each method per dataset. A marker indicates when a method reached the same or better performance than No ES. The solid black line visualizes the time saved by the fastest early stopping method per dataset.
  • Figure 2: Footprint Plot for OpenML Task ID 168350 on outer fold 7: A configuration footprint, that is, a multi-dimensional scaling (MDS) embedding of the high dimensional search space for MLP to a 2-dimensional one, showing the landscape of evaluated configurations that were either evaluated (big marker) or stopped early (small marker). Darker areas represent better-performing parts of the landscape, as estimated by a Random Forest surrogate trained on the configurations with a known performance. A red circle represents the area of the landscape centered around the incumbent configuration. An x indicates a border configuration. A border configuration is sampled from the edges of the conditional search space and helps the MDS embedding to separate clusters of related configurations, i.e., those sharing related preprocessing steps. Dashed lines show the boundary of viable configurations in the 2-dimensional MDS space, as estimated by a Random Forest, trained to predict if a configuration lies within a boxed region of the space. The plot margins indicate sampling density along the given axis. The sampling density is non-uniform even when using random search due to the scaling performed by MDS. This particular footprint example is an instance where Aggressive failed to outperform No ES.
  • Figure 3: Validation Performance Over Time: The validation score incumbent trace for each method and cross-validation scenario. The normalization of ROC AUC is explained in Section \ref{['sec/exp_setup']}
  • Figure 4: Test and Validation Performance Comparison Example: The validation (val) and test score incumbent trace for each method and the 10-foldinner cross-validation scenario. The normalization of ROC AUC is explained in Appendix \ref{['app:roc_auc_norm']}. Note that the y-scales for (val) and (test) are different to improve readability.