Table of Contents
Fetching ...

HPOBench: A Collection of Reproducible Multi-Fidelity Benchmark Problems for HPO

Katharina Eggensperger, Philipp Müller, Neeratyoy Mallik, Matthias Feurer, René Sass, Aaron Klein, Noor Awad, Marius Lindauer, Frank Hutter

TL;DR

<3-5 sentence high-level summary> HPOBench addresses the need for realistic, diverse, and reproducible benchmarks for hyperparameter optimization, with a particular emphasis on multi-fidelity problems. It delivers a containerized library of 12 benchmark families (7 existing, 5 new) totaling over 100 multi-fidelity problems, plus surrogate and tabular variants to enable scalable evaluations. The paper demonstrates broad compatibility by evaluating 13 optimizers from 6 optimization tools and shows that advanced multi-fidelity methods offer substantial benefits at small budgets while remaining competitive at larger budgets. This benchmark suite aims to standardize and accelerate progress in HPO, NAS, and transfer-MLO across diverse datasets and fidelities, enabling fair comparisons and long-term maintainability.

Abstract

To achieve peak predictive performance, hyperparameter optimization (HPO) is a crucial component of machine learning and its applications. Over the last years, the number of efficient algorithms and tools for HPO grew substantially. At the same time, the community is still lacking realistic, diverse, computationally cheap, and standardized benchmarks. This is especially the case for multi-fidelity HPO methods. To close this gap, we propose HPOBench, which includes 7 existing and 5 new benchmark families, with a total of more than 100 multi-fidelity benchmark problems. HPOBench allows to run this extendable set of multi-fidelity HPO benchmarks in a reproducible way by isolating and packaging the individual benchmarks in containers. It also provides surrogate and tabular benchmarks for computationally affordable yet statistically sound evaluations. To demonstrate HPOBench's broad compatibility with various optimization tools, as well as its usefulness, we conduct an exemplary large-scale study evaluating 13 optimizers from 6 optimization tools. We provide HPOBench here: https://github.com/automl/HPOBench.

HPOBench: A Collection of Reproducible Multi-Fidelity Benchmark Problems for HPO

TL;DR

<3-5 sentence high-level summary> HPOBench addresses the need for realistic, diverse, and reproducible benchmarks for hyperparameter optimization, with a particular emphasis on multi-fidelity problems. It delivers a containerized library of 12 benchmark families (7 existing, 5 new) totaling over 100 multi-fidelity problems, plus surrogate and tabular variants to enable scalable evaluations. The paper demonstrates broad compatibility by evaluating 13 optimizers from 6 optimization tools and shows that advanced multi-fidelity methods offer substantial benefits at small budgets while remaining competitive at larger budgets. This benchmark suite aims to standardize and accelerate progress in HPO, NAS, and transfer-MLO across diverse datasets and fidelities, enabling fair comparisons and long-term maintainability.

Abstract

To achieve peak predictive performance, hyperparameter optimization (HPO) is a crucial component of machine learning and its applications. Over the last years, the number of efficient algorithms and tools for HPO grew substantially. At the same time, the community is still lacking realistic, diverse, computationally cheap, and standardized benchmarks. This is especially the case for multi-fidelity HPO methods. To close this gap, we propose HPOBench, which includes 7 existing and 5 new benchmark families, with a total of more than 100 multi-fidelity benchmark problems. HPOBench allows to run this extendable set of multi-fidelity HPO benchmarks in a reproducible way by isolating and packaging the individual benchmarks in containers. It also provides surrogate and tabular benchmarks for computationally affordable yet statistically sound evaluations. To demonstrate HPOBench's broad compatibility with various optimization tools, as well as its usefulness, we conduct an exemplary large-scale study evaluating 13 optimizers from 6 optimization tools. We provide HPOBench here: https://github.com/automl/HPOBench.

Paper Structure

This paper contains 32 sections, 10 figures, 23 tables.

Figures (10)

  • Figure 1: Overview of benchmark environments with (upper) and without (lower) using containers.
  • Figure 2: Code example initializing and evaluating a benchmark.
  • Figure 3: Empirical cumulative distribution. Each plot corresponds to one ML algorithm, and each line within a plot corresponds to one dataset. The lines show the ECDF of the normalized regret of all evaluated configurations of the respective ML algorithm on the respective dataset.
  • Figure 4: Mean rank-over-time across $32$ repetitions of different sets of optimizers (lower is better). The left part shows rank across all existing community (upper row) and new (lower row) benchmarks . The right part reports results on the existing community benchmarks only for subsets of optimizers.
  • Figure 5: Median rank over time. We report the median rank of the performance across all benchmarks of a benchmark family (see Table \ref{['tab:benchmarks']}) for all optimizers.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Definition 1: HPO Benchmark
  • Definition 2: Tabular Benchmark
  • Definition 3: Surrogate Benchmark