Table of Contents
Fetching ...

LMEMs for post-hoc analysis of HPO Benchmarking

Anton Geburek, Neeratyoy Mallik, Danny Stoll, Xavier Bouthillier, Frank Hutter

TL;DR

The paper tackles the limitation of aggregating HPO benchmarking results into simple averages, which can mask hierarchical differences across benchmarks. It adopts Linear Mixed-Effects Models (LMEMs) for post-hoc significance testing, comparing $M_0: loss\sim algorithm$ with $M_1: loss\sim algorithm+(1|benchmark)$ using the Generalized Likelihood Ratio Test (GLRT), and optionally incorporating benchmark metafeatures. Its contributions include ready-to-use LMEM recipes, Autorank-equivalent analyses, and metafeature-guided post-hoc insights that reveal nuances such as budget effects and anomalous benchmarks. The work provides a practical, open-source toolkit for deeper empirical analysis of HPO benchmarks, with potential to improve reliability and resource efficiency in experimental design.

Abstract

The importance of tuning hyperparameters in Machine Learning (ML) and Deep Learning (DL) is established through empirical research and applications, evident from the increase in new hyperparameter optimization (HPO) algorithms and benchmarks steadily added by the community. However, current benchmarking practices using averaged performance across many datasets may obscure key differences between HPO methods, especially for pairwise comparisons. In this work, we apply Linear Mixed-Effect Models-based (LMEMs) significance testing for post-hoc analysis of HPO benchmarking runs. LMEMs allow flexible and expressive modeling on the entire experiment data, including information such as benchmark meta-features, offering deeper insights than current analysis practices. We demonstrate this through a case study on the PriorBand paper's experiment data to find insights not reported in the original work.

LMEMs for post-hoc analysis of HPO Benchmarking

TL;DR

The paper tackles the limitation of aggregating HPO benchmarking results into simple averages, which can mask hierarchical differences across benchmarks. It adopts Linear Mixed-Effects Models (LMEMs) for post-hoc significance testing, comparing with using the Generalized Likelihood Ratio Test (GLRT), and optionally incorporating benchmark metafeatures. Its contributions include ready-to-use LMEM recipes, Autorank-equivalent analyses, and metafeature-guided post-hoc insights that reveal nuances such as budget effects and anomalous benchmarks. The work provides a practical, open-source toolkit for deeper empirical analysis of HPO benchmarks, with potential to improve reliability and resource efficiency in experimental design.

Abstract

The importance of tuning hyperparameters in Machine Learning (ML) and Deep Learning (DL) is established through empirical research and applications, evident from the increase in new hyperparameter optimization (HPO) algorithms and benchmarks steadily added by the community. However, current benchmarking practices using averaged performance across many datasets may obscure key differences between HPO methods, especially for pairwise comparisons. In this work, we apply Linear Mixed-Effect Models-based (LMEMs) significance testing for post-hoc analysis of HPO benchmarking runs. LMEMs allow flexible and expressive modeling on the entire experiment data, including information such as benchmark meta-features, offering deeper insights than current analysis practices. We demonstrate this through a case study on the PriorBand paper's experiment data to find insights not reported in the original work.
Paper Structure (11 sections, 5 figures)

This paper contains 11 sections, 5 figures.

Figures (5)

  • Figure 1: (left) Fixed effects are fully observed and typically noise-free, i.e., loss (y-axis) recorded against algorithms (x-axis); (right) Random effects assume samples to be from some random distribution within each specific group, as described by $(1|\texttt{benchmark})$ and the $3$ lines representing $3$ groups benchmarks. Image sourced under CC BY-SA 4.0.
  • Figure 2: LMEMs can Autorank: (left) output of Autorank; (right) output of LMEMs on the same data with the simple model: $\texttt{loss}\sim\texttt{algorithm}$.
  • Figure 3: Preset sanity check are run on the experiment data to conclude that there is no algorithm where the seed explains the performance variation. There is also no benchmark where there is no performance difference across algorithms. It was also found that the used budget should be used as an interaction effect for LMEM models on this data.
  • Figure 4: Clustering of benchmarks over prior qualities
  • Figure 5: (left) Autorank at three different HPO budget horizons ($5\times$, $10\times$, $15\times$); (right) LMEM trained on all available data from $5-15\times$ budget, including the budget as a random effect: $\texttt{loss}\sim\texttt{algorithm}+(1|\texttt{budget})+(1|\texttt{benchmark})$.