AutoML Benchmark with shorter time constraints and early stopping
Israel Campero Jurado, Pieter Gijsbers, Joaquin Vanschoren
TL;DR
The paper addresses the computational burden of the AutoML Benchmark (AMLB) for tabular data and proposes evaluating with shorter budgets ($5$, $10$, $30$, and $60$ minutes) and early stopping. It evaluates $11$ AutoML frameworks on 104 OpenML tasks, using CP imputation for missing results and multiple AutoGluon configurations, and analyzes ranking stability with Critical Difference diagrams and ranking correlations. Results show framework rankings are highly stable across budgets (e.g., $r > 0.96$ between $60$ and $30$ minutes), with AutoGluon variants typically leading, while early stopping introduces more variability and potential time savings depending on framework and dataset. The findings support using shorter budgets to improve accessibility and sustainability of benchmarking, provided one accounts for framework-specific early-stopping behavior and potential trade-offs in large datasets; the work advocates broader adoption of $5$–$60$ minute benchmarks with transparent reporting.
Abstract
Automated Machine Learning (AutoML) automatically builds machine learning (ML) models on data. The de facto standard for evaluating new AutoML frameworks for tabular data is the AutoML Benchmark (AMLB). AMLB proposed to evaluate AutoML frameworks using 1- and 4-hour time budgets across 104 tasks. We argue that shorter time constraints should be considered for the benchmark because of their practical value, such as when models need to be retrained with high frequency, and to make AMLB more accessible. This work considers two ways in which to reduce the overall computation used in the benchmark: smaller time constraints and the use of early stopping. We conduct evaluations of 11 AutoML frameworks on 104 tasks with different time constraints and find the relative ranking of AutoML frameworks is fairly consistent across time constraints, but that using early-stopping leads to a greater variety in model performance.
