A Study of Bayesian Neural Network Surrogates for Bayesian Optimization

Yucen Lily Li; Tim G. J. Rudner; Andrew Gordon Wilson

A Study of Bayesian Neural Network Surrogates for Bayesian Optimization

Yucen Lily Li, Tim G. J. Rudner, Andrew Gordon Wilson

TL;DR

This work evaluates Bayesian neural network surrogates as alternatives to standard Gaussian process models for Bayesian optimization, across a wide range of synthetic and real-world problems. It covers fully stochastic finite-width BNNS with inference methods such as Hamiltonian Monte Carlo and stochastic gradient HMC, deep ensembles, deep kernel learning, linearized Laplace, and infinite-width BNNS, examining non-stationarity and high-dimensional inputs. Key findings include: HMC generally yields the strongest performance among fully stochastic BNNS; deep kernel learning is often competitive with GP baselines; deep ensembles tend to underperform; infinite-width BNNS show particular strength in high-dimensional settings; and no single surrogate dominates across all tasks, underscoring the value of a diversified surrogate toolkit. The study highlights the importance of non-Euclidean representations and problem-specific inductive biases, and provides a reproducible framework with public code to guide future surrogate selection in Bayesian optimization.

Abstract

Bayesian optimization is a highly efficient approach to optimizing objective functions which are expensive to query. These objectives are typically represented by Gaussian process (GP) surrogate models which are easy to optimize and support exact inference. While standard GP surrogates have been well-established in Bayesian optimization, Bayesian neural networks (BNNs) have recently become practical function approximators, with many benefits over standard GPs such as the ability to naturally handle non-stationarity and learn representations for high-dimensional data. In this paper, we study BNNs as alternatives to standard GP surrogates for optimization. We consider a variety of approximate inference procedures for finite-width BNNs, including high-quality Hamiltonian Monte Carlo, low-cost stochastic MCMC, and heuristics such as deep ensembles. We also consider infinite-width BNNs, linearized Laplace approximations, and partially stochastic models such as deep kernel learning. We evaluate this collection of surrogate models on diverse problems with varying dimensionality, number of objectives, non-stationarity, and discrete and continuous inputs. We find: (i) the ranking of methods is highly problem dependent, suggesting the need for tailored inductive biases; (ii) HMC is the most successful approximate inference procedure for fully stochastic BNNs; (iii) full stochasticity may be unnecessary as deep kernel learning is relatively competitive; (iv) deep ensembles perform relatively poorly; (v) infinite-width BNNs are particularly promising, especially in high dimensions.

A Study of Bayesian Neural Network Surrogates for Bayesian Optimization

TL;DR

Abstract

Paper Structure (61 sections, 25 figures, 2 tables)

This paper contains 61 sections, 25 figures, 2 tables.

Introduction
Related Work
Surrogate Models
Gaussian Processes.
Fully Stochastic Finite-Width Bayesian Neural Networks.
Deep Kernel Learning.
Linearized Laplace Approximation.
Infinite-Width Bayesian Neural Networks.
Role of Architecture
Model Hyperparameters.
Network Width and Depth.
Activation Function.
Empirical Evaluation
Synthetic Benchmarks
Real-World Benchmarks
...and 46 more sections

Figures (25)

Figure 1: The design of the bnn has a significant impact on the uncertainty estimates. We visualize the uncertainty estimates and function draws produced by full-batch hmc on a simple toy objective function with four function queries (denoted in black). For the visualizations above, we fix all other design choices with the following base parameters: likelihood variance $=1$, prior variance $= 1$, number of hidden layers $= 3$, and width $= 128$. We see that varying the different aspects of the model leads to significantly different posterior predictive distributions.
Figure 2: There is no single architecture for hmc that performs the best across all problems. We compare the impact of the design on the Bayesian optimization performance for different benchmark problems. For each set of experiments, we fix all other aspects of the design and plot the values of the maximum reward found using hmc after 100 function evaluations over 10 trials.
Figure 3: bnns are often comparable to gps on standard synthetic benchmarks. However, the type of bnn used has a big impact: hmc typically outperforms other bnn approximation methods, while sghmc and deep ensembles seem to have less reliable performance and are often unable to effectively find the maximum. lla also has poor performance across the single-objective problems. For each benchmark function, we include $d$ for the number of input dimensions, and $o$ for the number of objectives. We plot the mean and one standard error of the mean over 10 trials.
Figure 4: Real world benchmarks show mixed results. bnns outperform gps on some problems and underperform on others, and there does not seem to be a noticeable preference for any particular surrogate as we increase the number of input dimensions. Additionally, there does not appear to be a clear separation between the top row of experiments, which optimize over continuous parameters, and the bottom row of experiments, which also include some discrete inputs. For each benchmark, we include $d$ for the number of input dimensions, and $o$ for the number of objectives. We plot the mean and one standard error of the mean over 10 trials.
Figure 5: i-bnns outperform other surrogates in many high-dimensional settings. We show the results of maximizing a polynomial function (left), maximizing a fixed function draw from a neural network (center), and optimizing the parameters of a neural network in the context of knowledge distillation (right). All of these objectives are high-dimensional and non-stationary, and we find that bnns consistently find higher rewards than gps across all problems. We plot the mean and one standard error of the mean over 10 trials, and $d$ corresponds to the number of input dimensions.
...and 20 more figures

A Study of Bayesian Neural Network Surrogates for Bayesian Optimization

TL;DR

Abstract

A Study of Bayesian Neural Network Surrogates for Bayesian Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (25)