The Behavior and Convergence of Local Bayesian Optimization

Kaiwen Wu; Kyurae Kim; Roman Garnett; Jacob R. Gardner

The Behavior and Convergence of Local Bayesian Optimization

Kaiwen Wu, Kyurae Kim, Roman Garnett, Jacob R. Gardner

TL;DR

The behavior of the local approach is studied, and it is found that the statistics of individual local solutions of Gaussian process sample paths are surprisingly good compared to what one would expect to recover from global methods.

Abstract

A recent development in Bayesian optimization is the use of local optimization strategies, which can deliver strong empirical performance on high-dimensional problems compared to traditional global strategies. The "folk wisdom" in the literature is that the focus on local optimization sidesteps the curse of dimensionality; however, little is known concretely about the expected behavior or convergence of Bayesian local optimization routines. We first study the behavior of the local approach, and find that the statistics of individual local solutions of Gaussian process sample paths are surprisingly good compared to what we would expect to recover from global methods. We then present the first rigorous analysis of such a Bayesian local optimization algorithm recently proposed by Müller et al. (2021), and derive convergence rates in both the noisy and noiseless settings.

The Behavior and Convergence of Local Bayesian Optimization

TL;DR

Abstract

Paper Structure (29 sections, 34 theorems, 115 equations, 5 figures, 1 algorithm)

This paper contains 29 sections, 34 theorems, 115 equations, 5 figures, 1 algorithm.

Introduction
Background and Related Work
Bayesian optimization.
Gaussian processes.
Existing bounds for global BO.
Gaussian process derivatives.
How Good Are Local Solutions?
A Local Bayesian Optimization Algorithm
Convergence Results
Convergence in the Noiseless Setting
Convergence in the Noisy Setting
Challenges.
Additional Experiments
How loose are our convergence rates?
What is the effect of multiple restarts?
...and 14 more sections

Key Result

Lemma 1

For any $f \in \mathcal{H}$, any $\mathbf{x} \in \mathcal{X}$ and any $\mathcal{D}$, we have the following inequality

Figures (5)

Figure 1: The unreasonable effectiveness of locally optimizing GP sample paths. (Top row): Distributions of local solutions found when locally optimizing GP sample paths in various numbers of dimensions, with varying amounts of noise. (Bottom left): The minimum sample complexity of grid search required to achieve the median value found by GIBO ($\sigma = 0$) in expectation. (Bottom middle, right): The performance of global optimization algorithms GP-UCB and random search in this setting. See §\ref{['sec:how_good']} for details.
Figure 2: Compare the error function \ref{['eq:error-function']} of the $\nu = 2.5$ Matérn kernel and our upper bound in \ref{['thm:bound-error-function-matern']}. The error function $E_{d, k, \sigma}(b)$ is approximated by minimizing \ref{['eq:error-function']} with L-BFGS. Both plots are in log-log scale. Left: The slope indicates the exponent on $b$. Since the slope magnitude of the error function is slightly larger, the error function might decreases slightly faster than $\mathcal{O}(b^{-\frac{1}{2}})$ asymptotically. Right: The slope indicates the exponent on $d$. Since all lines have roughly the same slope, the dependency on the dimension in \ref{['thm:bound-error-function-matern']} seems to be tight.
Figure 3: The performance of random restart on a GP sample path in 100 dimensions. Left: a density plot for the minimum value found on a single restart (compare with Figure \ref{['fig:boxplots']}). Right: the median and a 90% confidence interval for the best value found after a given number of random restarts.
Figure 4: Estimating the "derivative" at $\mathbf{x} = (0, 1)$ with a Matérn Gausssian process ($\nu = 2.5$) in three different settings. Left:$f(\mathbf{x}) = \frac{1}{2} \lVert\mathbf{x}\rVert^2$. With $n = 5$ samples, the posterior mean gradient is close to the ground truth. Middle:$f(\mathbf{x}) = \lVert\mathbf{x}\rVert_1$. The $\ell_1$ norm is not differentiable at $(0, 1)$. With exactly the same samples as the left panel, the posterior mean gradient has higher error. Right:$f(\mathbf{x}) = \lVert\mathbf{x}\rVert_1$. Increasing the sample size to $n = 10$ decreases the estimation error.
Figure 5: Estimating the "derivative" of ReLU at $x = 0$ with noisy observations ($\sigma = 0.01$).

Theorems & Definitions (37)

Remark 1: e.g., devroye2001combinatorialkamath2015bounds
Definition 1: Smoothness
Definition 2: Error function
Lemma 1
Lemma 2
Theorem 1
Corollary 1
Lemma 3
Theorem 2
Lemma 4: RBF Kernel
...and 27 more

The Behavior and Convergence of Local Bayesian Optimization

TL;DR

Abstract

The Behavior and Convergence of Local Bayesian Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (37)