Table of Contents
Fetching ...

An Open Source AutoML Benchmark

Pieter Gijsbers, Erin LeDell, Janek Thomas, Sébastien Poirier, Bernd Bischl, Joaquin Vanschoren

TL;DR

AutoML benchmarking has suffered from non-standardized datasets and inconsistent evaluation. This work introduces an open-source, extensible benchmark framework with a public results site to enable ongoing, fair comparisons across AutoML systems. It evaluates four AutoML tools across $39$ classification datasets using standardized resource budgets and evaluation metrics, revealing that no tool consistently outperforms a tuned Random Forest and that performance varies by dataset. The framework also discusses meta-learning fairness and future directions, emphasizing reproducibility and practical impact for AutoML research.

Abstract

In recent years, an active field of research has developed around automated machine learning (AutoML). Unfortunately, comparing different AutoML systems is hard and often done incorrectly. We introduce an open, ongoing, and extensible benchmark framework which follows best practices and avoids common mistakes. The framework is open-source, uses public datasets and has a website with up-to-date results. We use the framework to conduct a thorough comparison of 4 AutoML systems across 39 datasets and analyze the results.

An Open Source AutoML Benchmark

TL;DR

AutoML benchmarking has suffered from non-standardized datasets and inconsistent evaluation. This work introduces an open-source, extensible benchmark framework with a public results site to enable ongoing, fair comparisons across AutoML systems. It evaluates four AutoML tools across classification datasets using standardized resource budgets and evaluation metrics, revealing that no tool consistently outperforms a tuned Random Forest and that performance varies by dataset. The framework also discusses meta-learning fairness and future directions, emphasizing reproducibility and practical impact for AutoML research.

Abstract

In recent years, an active field of research has developed around automated machine learning (AutoML). Unfortunately, comparing different AutoML systems is hard and often done incorrectly. We introduce an open, ongoing, and extensible benchmark framework which follows best practices and avoids common mistakes. The framework is open-source, uses public datasets and has a website with up-to-date results. We use the framework to conduct a thorough comparison of 4 AutoML systems across 39 datasets and analyze the results.

Paper Structure

This paper contains 11 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Scores obtained on each dataset by each framework on each of ten folds. On the left are binary classification problems with their AUROC scores, on the right are multi-class classification problems with logloss. Opaque diamonds represent the average score across all folds.