Table of Contents
Fetching ...

AutoML Benchmark with shorter time constraints and early stopping

Israel Campero Jurado, Pieter Gijsbers, Joaquin Vanschoren

TL;DR

The paper addresses the computational burden of the AutoML Benchmark (AMLB) for tabular data and proposes evaluating with shorter budgets ($5$, $10$, $30$, and $60$ minutes) and early stopping. It evaluates $11$ AutoML frameworks on 104 OpenML tasks, using CP imputation for missing results and multiple AutoGluon configurations, and analyzes ranking stability with Critical Difference diagrams and ranking correlations. Results show framework rankings are highly stable across budgets (e.g., $r > 0.96$ between $60$ and $30$ minutes), with AutoGluon variants typically leading, while early stopping introduces more variability and potential time savings depending on framework and dataset. The findings support using shorter budgets to improve accessibility and sustainability of benchmarking, provided one accounts for framework-specific early-stopping behavior and potential trade-offs in large datasets; the work advocates broader adoption of $5$–$60$ minute benchmarks with transparent reporting.

Abstract

Automated Machine Learning (AutoML) automatically builds machine learning (ML) models on data. The de facto standard for evaluating new AutoML frameworks for tabular data is the AutoML Benchmark (AMLB). AMLB proposed to evaluate AutoML frameworks using 1- and 4-hour time budgets across 104 tasks. We argue that shorter time constraints should be considered for the benchmark because of their practical value, such as when models need to be retrained with high frequency, and to make AMLB more accessible. This work considers two ways in which to reduce the overall computation used in the benchmark: smaller time constraints and the use of early stopping. We conduct evaluations of 11 AutoML frameworks on 104 tasks with different time constraints and find the relative ranking of AutoML frameworks is fairly consistent across time constraints, but that using early-stopping leads to a greater variety in model performance.

AutoML Benchmark with shorter time constraints and early stopping

TL;DR

The paper addresses the computational burden of the AutoML Benchmark (AMLB) for tabular data and proposes evaluating with shorter budgets (, , , and minutes) and early stopping. It evaluates AutoML frameworks on 104 OpenML tasks, using CP imputation for missing results and multiple AutoGluon configurations, and analyzes ranking stability with Critical Difference diagrams and ranking correlations. Results show framework rankings are highly stable across budgets (e.g., between and minutes), with AutoGluon variants typically leading, while early stopping introduces more variability and potential time savings depending on framework and dataset. The findings support using shorter budgets to improve accessibility and sustainability of benchmarking, provided one accounts for framework-specific early-stopping behavior and potential trade-offs in large datasets; the work advocates broader adoption of minute benchmarks with transparent reporting.

Abstract

Automated Machine Learning (AutoML) automatically builds machine learning (ML) models on data. The de facto standard for evaluating new AutoML frameworks for tabular data is the AutoML Benchmark (AMLB). AMLB proposed to evaluate AutoML frameworks using 1- and 4-hour time budgets across 104 tasks. We argue that shorter time constraints should be considered for the benchmark because of their practical value, such as when models need to be retrained with high frequency, and to make AMLB more accessible. This work considers two ways in which to reduce the overall computation used in the benchmark: smaller time constraints and the use of early stopping. We conduct evaluations of 11 AutoML frameworks on 104 tasks with different time constraints and find the relative ranking of AutoML frameworks is fairly consistent across time constraints, but that using early-stopping leads to a greater variety in model performance.

Paper Structure

This paper contains 23 sections, 23 figures, 3 tables.

Figures (23)

  • Figure 1: Critical Diagrams (CD) of the evaluated frameworks by time constraints. These diagrams include the Nemenyi post-hoc test.
  • Figure 2: (a) Confusion matrix based on ranking's correlation mean performance per task over the times. (b) Histogram and Density plot of the correlation from the ranking of each framework by task at each time (c) Histogram and Density plot of the correlation from the ranking of each framework by task from the original AMLB results 1 hour vs 4 hours
  • Figure 3: Performance distribution (a) and regret and time saved (b) across the AutoML frameworks.
  • Figure 4: Relative improved performance reached in each time constraint when considering 5 minutes as 100% over binary tasks.
  • Figure 5: Density plot of the correlation from the ranking of the means of each framework by each task at each time, divided into instances and features number
  • ...and 18 more figures