LMEMs for post-hoc analysis of HPO Benchmarking
Anton Geburek, Neeratyoy Mallik, Danny Stoll, Xavier Bouthillier, Frank Hutter
TL;DR
The paper tackles the limitation of aggregating HPO benchmarking results into simple averages, which can mask hierarchical differences across benchmarks. It adopts Linear Mixed-Effects Models (LMEMs) for post-hoc significance testing, comparing $M_0: loss\sim algorithm$ with $M_1: loss\sim algorithm+(1|benchmark)$ using the Generalized Likelihood Ratio Test (GLRT), and optionally incorporating benchmark metafeatures. Its contributions include ready-to-use LMEM recipes, Autorank-equivalent analyses, and metafeature-guided post-hoc insights that reveal nuances such as budget effects and anomalous benchmarks. The work provides a practical, open-source toolkit for deeper empirical analysis of HPO benchmarks, with potential to improve reliability and resource efficiency in experimental design.
Abstract
The importance of tuning hyperparameters in Machine Learning (ML) and Deep Learning (DL) is established through empirical research and applications, evident from the increase in new hyperparameter optimization (HPO) algorithms and benchmarks steadily added by the community. However, current benchmarking practices using averaged performance across many datasets may obscure key differences between HPO methods, especially for pairwise comparisons. In this work, we apply Linear Mixed-Effect Models-based (LMEMs) significance testing for post-hoc analysis of HPO benchmarking runs. LMEMs allow flexible and expressive modeling on the entire experiment data, including information such as benchmark meta-features, offering deeper insights than current analysis practices. We demonstrate this through a case study on the PriorBand paper's experiment data to find insights not reported in the original work.
