Table of Contents
Fetching ...

Pre-trained Gaussian Processes for Bayesian Optimization

Zi Wang, George E. Dahl, Kevin Swersky, Chansoo Lee, Zachary Nado, Justin Gilmer, Jasper Snoek, Zoubin Ghahramani

TL;DR

Theoretically, bounded posterior predictions and near-zero regrets for HyperBO are shown without assuming the"ground truth"GP prior is known, and results show that on average, HyperBO is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods on both the authors' new tuning dataset and existing multi-task BO benchmarks.

Abstract

Bayesian optimization (BO) has become a popular strategy for global optimization of expensive real-world functions. Contrary to a common expectation that BO is suited to optimizing black-box functions, it actually requires domain knowledge about those functions to deploy BO successfully. Such domain knowledge often manifests in Gaussian process (GP) priors that specify initial beliefs on functions. However, even with expert knowledge, it is non-trivial to quantitatively define a prior. This is especially true for hyperparameter tuning problems on complex machine learning models, where landscapes of tuning objectives are often difficult to comprehend. We seek an alternative practice for setting these functional priors. In particular, we consider the scenario where we have data from similar functions that allow us to pre-train a tighter distribution a priori. We detail what pre-training entails for GPs using a KL divergence based loss function, and propose a new pre-training based BO framework named HyperBO. Theoretically, we show bounded posterior predictions and near-zero regrets for HyperBO without assuming the "ground truth" GP prior is known. To verify our approach in realistic setups, we collect a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art deep learning models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, HyperBO is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods on both our new tuning dataset and existing multi-task BO benchmarks.

Pre-trained Gaussian Processes for Bayesian Optimization

TL;DR

Theoretically, bounded posterior predictions and near-zero regrets for HyperBO are shown without assuming the"ground truth"GP prior is known, and results show that on average, HyperBO is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods on both the authors' new tuning dataset and existing multi-task BO benchmarks.

Abstract

Bayesian optimization (BO) has become a popular strategy for global optimization of expensive real-world functions. Contrary to a common expectation that BO is suited to optimizing black-box functions, it actually requires domain knowledge about those functions to deploy BO successfully. Such domain knowledge often manifests in Gaussian process (GP) priors that specify initial beliefs on functions. However, even with expert knowledge, it is non-trivial to quantitatively define a prior. This is especially true for hyperparameter tuning problems on complex machine learning models, where landscapes of tuning objectives are often difficult to comprehend. We seek an alternative practice for setting these functional priors. In particular, we consider the scenario where we have data from similar functions that allow us to pre-train a tighter distribution a priori. We detail what pre-training entails for GPs using a KL divergence based loss function, and propose a new pre-training based BO framework named HyperBO. Theoretically, we show bounded posterior predictions and near-zero regrets for HyperBO without assuming the "ground truth" GP prior is known. To verify our approach in realistic setups, we collect a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art deep learning models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, HyperBO is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods on both our new tuning dataset and existing multi-task BO benchmarks.

Paper Structure

This paper contains 58 sections, 14 theorems, 100 equations, 29 figures, 5 tables, 1 algorithm.

Key Result

Proposition 4

Let ${\boldsymbol{x}}=[x_j]_{j=1}^M\in\mathbb{R}^{M\times d}, Y=[y^{(i)}_j]_{ j\in[M], i\in[N]}\in \mathbb{R}^{M\times N}$. Given Assumption asp:iid, asp:noise and asp:finite, we have such that $\hat{\mu}, \hat{k}\circ{\hat{\sigma}^2} = \underset{\mu, k\circ\sigma^2}{\mathop{\mathrm{arg\,min}}\nolimits} \;{\mathcal{L}}^{\text{EKL}}(\mu, k\circ\sigma^2) = \underset{\mu, k\circ\sigma^2}{\mathop{\ma

Figures (29)

  • Figure 1: During pre-training, we optimize a Gaussian process (GP) such that it can gradually generate functions (illustrated as grey dotted lines) that are similar to the training functions. The similarity manifests in individual function values and correlations between function values indicated by smoothness and wiggliness. The blue line illustrates the mean function of the GP and the shaded areas are the $99\%$ and $95\%$ confidence intervals. For an unknown test function, we can derive a posterior conditioned on observed datapoints (illustrated as black dots) and the pre-trained GP prior. Compared to a GP fit to observations without pre-training, the pre-trained GP posterior captures the test function much better, which is a critical prerequisite for Bayesian optimization.
  • Figure 2: The generating process of our training data: for each $i\in [N]$, $f_i \sim \mathcal{GP}(\mu^*, k^*)$, and for each $j\in[M_i]$, $y^{(i)}_{j}\sim {\mathcal{N}}\left(f_i(x^{(i)}_{j}), \sigma_*^2 \right)$, where mean function $\mu^*$, kernel function $k^*$ and noise variance $\sigma_*^2$ are unknown.
  • Figure 3: We define the a ground truth GP that has a zero mean and a Matern 5/2 kernel with amplitude $1.0$ and lengthscale $1.0$ on a 1-dimensional domain. We then sample 3 i.i.d. training functions from the GP and their evaluations on 5 inputs. Those 5 inputs are sampled i.i.d. from the standard normal distribution. The model is a GP with a constant mean function parameterized by a constant, and a squared exponential kernel parameterized by a lengthscale and an amplitude value. The figures visualize EKL and NLL (scaled by $\frac{\min EKL}{\min NLL}$ to allow consistency on scale) over each parameter with the other two fixed. In this setting, EKL and NLL have different landscapes and different $\mathop{\mathrm{arg\,min}}\nolimits$ locations.
  • Figure 4: Top left shows 10 training functions, each with one color, sampled from a ground truth GP (a multivariate Gaussian) on a finite domain $\mathfrak X = \{1,2,3,4,5\}$. The top middle plot shows probability densities of the marginal Gaussian distribution for each function value evaluated with the ground truth, MLE estimate, unbiased estimate and pre-trained GP. The following plots show their conditional distributions (i.e., posteriors). With increased observations (black dots), estimate-based posterior predictions become less accurate despite close estimations of the prior.
  • Figure 5: We used the same setup as Figure \ref{['fig:posterior']}, except that the size of the training dataset is $N=50$. With more training functions, we obtain more accurate pre-trained GP posterior predictions.
  • ...and 24 more figures

Theorems & Definitions (15)

  • Example 4.1
  • Proposition 4
  • Theorem 5
  • Theorem 6
  • Lemma 8
  • Lemma 9
  • Lemma 10
  • Corollary 11: Bernoulli's inequality
  • Lemma 12
  • Lemma 13: Lemma 5.3 of srinivas2009gaussian
  • ...and 5 more