Table of Contents
Fetching ...

Freeze-Thaw Bayesian Optimization

Kevin Swersky, Jasper Snoek, Ryan Prescott Adams

TL;DR

The paper tackles efficient hyperparameter search by leveraging partial progress during iterative training to pause, resume, or spawn new trials. It introduces a nonparametric kernel for training curves based on an infinite mixture of exponentially decaying bases, enabling accurate forecasting of final performance from early observations. It also proposes a scalable spatiotemporal Gaussian process prior that models a global mean across hyperparameters and independent per-curve GPs, enabling efficient inference via Woodbury identities. An information-theoretic framework, specifically entropy search, guides when to freeze, thaw, or initialize models by maximizing expected information about the asymptotic minimum. Empirical results on logistic regression, online LDA, and PMF show substantial speedups over previous BO methods, confirming the practicality of dynamic training management.

Abstract

In this paper we develop a dynamic form of Bayesian optimization for machine learning models with the goal of rapidly finding good hyperparameter settings. Our method uses the partial information gained during the training of a machine learning model in order to decide whether to pause training and start a new model, or resume the training of a previously-considered model. We specifically tailor our method to machine learning problems by developing a novel positive-definite covariance kernel to capture a variety of training curves. Furthermore, we develop a Gaussian process prior that scales gracefully with additional temporal observations. Finally, we provide an information-theoretic framework to automate the decision process. Experiments on several common machine learning models show that our approach is extremely effective in practice.

Freeze-Thaw Bayesian Optimization

TL;DR

The paper tackles efficient hyperparameter search by leveraging partial progress during iterative training to pause, resume, or spawn new trials. It introduces a nonparametric kernel for training curves based on an infinite mixture of exponentially decaying bases, enabling accurate forecasting of final performance from early observations. It also proposes a scalable spatiotemporal Gaussian process prior that models a global mean across hyperparameters and independent per-curve GPs, enabling efficient inference via Woodbury identities. An information-theoretic framework, specifically entropy search, guides when to freeze, thaw, or initialize models by maximizing expected information about the asymptotic minimum. Empirical results on logistic regression, online LDA, and PMF show substantial speedups over previous BO methods, confirming the practicality of dynamic training management.

Abstract

In this paper we develop a dynamic form of Bayesian optimization for machine learning models with the goal of rapidly finding good hyperparameter settings. Our method uses the partial information gained during the training of a machine learning model in order to decide whether to pause training and start a new model, or resume the training of a previously-considered model. We specifically tailor our method to machine learning problems by developing a novel positive-definite covariance kernel to capture a variety of training curves. Furthermore, we develop a Gaussian process prior that scales gracefully with additional temporal observations. Finally, we provide an information-theoretic framework to automate the decision process. Experiments on several common machine learning models show that our approach is extremely effective in practice.

Paper Structure

This paper contains 25 sections, 23 equations, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: Example functions from the exponential decay kernel. (a) Example functions from our basis set with $\alpha=1.0$ and $\beta=0.5$. (b) Samples from a Gaussian process with this covariance function. (c) Samples from a Gaussian process conditioned on the curves starting at a positive number and with an added Ornstein-Uhlenbeck kernel to simulate natural training curves.
  • Figure 2: \ref{['fig:spatiotemporalgp']} Factor graph representation of the GP model for training procedures. Each row represents a learning curve that is drawn from an independent GP prior, conditioned on its mean. The mean of each learning curve is jointly drawn with the mean of the other curves using another GP prior. \ref{['fig:temporalpreds']} Partially completed training curves, and the GP prediction at their eventual asymptotes. \ref{['fig:spatialpreds']} The posterior GP prediction at the asymptote. Each colored point represents the GP prediction at the hyperparameter location corresponding to a training curve with the same color.
  • Figure 3: This figure shows the results of the empirical comparison to standard Bayesian optimization on three common machine learning hyperparameter optimization problems. For each problem we report the lowest loss observed over all training epochs evaluated by each method, averaged over five optimization runs.
  • Figure 4: A visualisation of the progression of the optimization curves throughout the optimization procedure on the probabilistic matrix factorization problem. Figure \ref{['fig:error_curves_pmf_3d']} shows for each distinct hyperparameter setting evaluated, the optimization curve run out by the procedure. The curves are ordered by the iteration in the Bayesian optimization procedure that they were started and each epoch of each curve is colored by the iteration of the Bayesian optimization that this section was evaluated. From the figure we can see that the procedure frequently stops running training curves that are not promising and often returns to promising training curves that were previously started. Figure \ref{['fig:error_curves_pmf']} shows a two dimensional cross section of Figure \ref{['fig:error_curves_pmf_3d']}.