Table of Contents
Fetching ...

In-Context Freeze-Thaw Bayesian Optimization for Hyperparameter Optimization

Herilalaina Rakotoarison, Steven Adriaensen, Neeratyoy Mallik, Samir Garibov, Edward Bergman, Frank Hutter

TL;DR

We address the cost and reliability challenges of hyperparameter optimization for deep learning by replacing online surrogate updates with an in-context, Transformer-based surrogate (FT-PFN) trained on synthetic priors. By coupling FT-PFN with a randomized MFPI acquisition (MFPI-random) in the in-context Freeze-Thaw BO framework (ifBO), the method performs Bayesian learning curve extrapolation in a single forward pass, avoiding refitting. Empirically, FT-PFN yields higher predictive quality and 10–100× faster inference than prior surrogates and achieves new state-of-the-art performance on three DL HPO benchmarks in the low-budget regime. The approach offers practical impact by reducing overhead and enabling robust, scalable HPO, with open-source code to enable reproducibility and further research.

Abstract

With the increasing computational costs associated with deep learning, automated hyperparameter optimization methods, strongly relying on black-box Bayesian optimization (BO), face limitations. Freeze-thaw BO offers a promising grey-box alternative, strategically allocating scarce resources incrementally to different configurations. However, the frequent surrogate model updates inherent to this approach pose challenges for existing methods, requiring retraining or fine-tuning their neural network surrogates online, introducing overhead, instability, and hyper-hyperparameters. In this work, we propose FT-PFN, a novel surrogate for Freeze-thaw style BO. FT-PFN is a prior-data fitted network (PFN) that leverages the transformers' in-context learning ability to efficiently and reliably do Bayesian learning curve extrapolation in a single forward pass. Our empirical analysis across three benchmark suites shows that the predictions made by FT-PFN are more accurate and 10-100 times faster than those of the deep Gaussian process and deep ensemble surrogates used in previous work. Furthermore, we show that, when combined with our novel acquisition mechanism (MFPI-random), the resulting in-context freeze-thaw BO method (ifBO), yields new state-of-the-art performance in the same three families of deep learning HPO benchmarks considered in prior work.

In-Context Freeze-Thaw Bayesian Optimization for Hyperparameter Optimization

TL;DR

We address the cost and reliability challenges of hyperparameter optimization for deep learning by replacing online surrogate updates with an in-context, Transformer-based surrogate (FT-PFN) trained on synthetic priors. By coupling FT-PFN with a randomized MFPI acquisition (MFPI-random) in the in-context Freeze-Thaw BO framework (ifBO), the method performs Bayesian learning curve extrapolation in a single forward pass, avoiding refitting. Empirically, FT-PFN yields higher predictive quality and 10–100× faster inference than prior surrogates and achieves new state-of-the-art performance on three DL HPO benchmarks in the low-budget regime. The approach offers practical impact by reducing overhead and enabling robust, scalable HPO, with open-source code to enable reproducibility and further research.

Abstract

With the increasing computational costs associated with deep learning, automated hyperparameter optimization methods, strongly relying on black-box Bayesian optimization (BO), face limitations. Freeze-thaw BO offers a promising grey-box alternative, strategically allocating scarce resources incrementally to different configurations. However, the frequent surrogate model updates inherent to this approach pose challenges for existing methods, requiring retraining or fine-tuning their neural network surrogates online, introducing overhead, instability, and hyper-hyperparameters. In this work, we propose FT-PFN, a novel surrogate for Freeze-thaw style BO. FT-PFN is a prior-data fitted network (PFN) that leverages the transformers' in-context learning ability to efficiently and reliably do Bayesian learning curve extrapolation in a single forward pass. Our empirical analysis across three benchmark suites shows that the predictions made by FT-PFN are more accurate and 10-100 times faster than those of the deep Gaussian process and deep ensemble surrogates used in previous work. Furthermore, we show that, when combined with our novel acquisition mechanism (MFPI-random), the resulting in-context freeze-thaw BO method (ifBO), yields new state-of-the-art performance in the same three families of deep learning HPO benchmarks considered in prior work.
Paper Structure (47 sections, 9 equations, 17 figures, 2 tables, 2 algorithms)

This paper contains 47 sections, 9 equations, 17 figures, 2 tables, 2 algorithms.

Figures (17)

  • Figure 1: Comparison of freeze-thaw surrogate model predictions, given the same set of hyperparameters (HPs) and their partial learning curves. The Ground truth curves show the real learning curves with dots ($\mathop{ \vcenter{\hbox{\LARGE$\cdot$}} }$) indicating the points observed as training set or context for all the surrogates. $\texttt{ifBO}$ uses $\texttt{FT-PFN}$ as its surrogate, which requires no refitting but instead uses the training dots as context for inferring the posterior predictive distribution of the model performance obtained at step $b$ using any set of given HPs. Surrogates used in prior art, using Deep Power Laws Ensembles ($\texttt{DPL}$) and Deep Kernel Gaussian Process ($\texttt{DyHPO}$) respectively, are trained on the training set till convergence and then used to extrapolate the given partial curves. The bottom row shows for each surrogate, the probabilistic performance predictions made at step $50$ (last step in top row), with the stars ($\star$) indicating the true value of the curve.
  • Figure 2: Diagram for the prior data model described in Section \ref{['sec:surrogate']} that was used to generate data for meta-training $\texttt{FT-PFN}$. On the left, we have the randomly initialized neural network $\mathop{\mathrm{\pi_{\text{config}}}}\limits$ that models the relationship between a hyperparameter setting $\lambda$ and its learning curve (shown in pink), whose output parameterizes a curve model $\mathop{\mathrm{\pi_{\text{curve}}}}\limits$ that is a linear combination of $K$ (=2 in this illustration) basis functions (shown in red and blue) with added $\lambda$-specific Gaussian noise with variance $\sigma^2$.
  • Figure 3: Comparison of our method against state-of-the-art baselines on all 3 benchmarks. First row shows normalized regret aggregated across multiple tasks in each benchmark (See Appendix \ref{['a:benchmarks']} for benchmark details, and the results per task can be found in Appendix \ref{['a:per_task_plot']}). Second row shows the average ranks of each method.
  • Figure 4: Results of an ablation study of the acquisition function in $\texttt{ifBO}$ on each benchmark family. First row shows normalized regret aggregated across multiple tasks in each benchmark (Appendix \ref{['a:benchmarks']}). Second row shows the average ranks of each method.
  • Figure 5: Twenty-one i.i.d. samples of the $\texttt{FT-PFN}$ prior, i.e., synthetically generated collections of learning curves for the same task using different hyperparameter configurations. In these examples, we consider 3 hyperparameters that are mapped onto the color of the curves, such that runs using similar hyperparameters, have similarly colored curves. We observe correlations, in varying degrees, between curves on the same task, especially with similar hyperparameter configurations.
  • ...and 12 more figures