Vanilla Bayesian Optimization Performs Great in High Dimensions

Carl Hvarfner; Erik Orm Hellsten; Luigi Nardi

Vanilla Bayesian Optimization Performs Great in High Dimensions

Carl Hvarfner, Erik Orm Hellsten, Luigi Nardi

TL;DR

The paper tackles the long standing claim that Bayesian optimization struggles in high dimensions due to strong model complexity. It shows that vanilla BO can perform poorly when the GP lengthscale prior inflates complexity with dimensionality, and proposes a simple fix by scaling the lengthscale prior with dimension, e.g. $ u_i \sim \text{LogNormal}(\mu_0 + \tfrac{\log D}{2}, \sigma_0)$, to keep correlation meaningful across dimensions. With this plug in, the authors demonstrate that vanilla BO drastically outperforms state of the art high dimensional BO methods on multiple real world tasks, effectively handling thousands of dimensions. The result provides a practical, scalable, and general approach that broadens the applicability of GP based BO without imposing strong objective structure, while still acknowledging that specialized HDBO methods may win when problem structure aligns with their assumptions.

Abstract

High-dimensional problems have long been considered the Achilles' heel of Bayesian optimization algorithms. Spurred by the curse of dimensionality, a large collection of algorithms aim to make it more performant in this setting, commonly by imposing various simplifying assumptions on the objective. In this paper, we identify the degeneracies that make vanilla Bayesian optimization poorly suited to high-dimensional tasks, and further show how existing algorithms address these degeneracies through the lens of lowering the model complexity. Moreover, we propose an enhancement to the prior assumptions that are typical to vanilla Bayesian optimization algorithms, which reduces the complexity to manageable levels without imposing structural restrictions on the objective. Our modification - a simple scaling of the Gaussian process lengthscale prior with the dimensionality - reveals that standard Bayesian optimization works drastically better than previously thought in high dimensions, clearly outperforming existing state-of-the-art algorithms on multiple commonly considered real-world high-dimensional tasks.

Vanilla Bayesian Optimization Performs Great in High Dimensions

TL;DR

, to keep correlation meaningful across dimensions. With this plug in, the authors demonstrate that vanilla BO drastically outperforms state of the art high dimensional BO methods on multiple real world tasks, effectively handling thousands of dimensions. The result provides a practical, scalable, and general approach that broadens the applicability of GP based BO without imposing strong objective structure, while still acknowledging that specialized HDBO methods may win when problem structure aligns with their assumptions.

Abstract

Paper Structure (43 sections, 2 theorems, 17 equations, 22 figures, 2 tables)

This paper contains 43 sections, 2 theorems, 17 equations, 22 figures, 2 tables.

Introduction
Background
Gaussian Processes
Bayesian Optimization
A Working Definition of "Vanilla" BO
The Maximal Information Gain
Related Work
Low-dimensional active subspaces
Additive kernels
Local Bayesian optimization
Non-Euclidean kernels
Pitfalls of High-Complexity Assumptions
Complexity and Dimensionality
The Boundary Issue Revisited
Complexity of Existing HDBO
...and 28 more sections

Key Result

Proposition 4.1

$\quad$ Assume that $y_{max} > c$, $\mathbf{K} = \sigma_f^2 \bm{I}$ and that the candidate query $\bm{x}_*$ correlates with at most one observation. Then, the correlation $\rho^* = \sigma_f^{-2}k(\bm{x}_*, \bm{x}_{inc})$ between the next query $\bm{x}_* =\mathop{\mathrm{arg\,max}}_{\bm{x}\in \mathca

Figures (22)

Figure 1: Three models (green, blue, red) with varying lengthscales, and thus varying complexity, attempting to model the same objective, acquiring data by greedily maximizing the IG. The MIG is shown for the three models as well as an independent kernel (dashed black), where the matrix $\mathbf{K} = \bm{I}$. The MIG for the complex model closely follows the independent kernel for 20 samples, suggesting that the complex model can acquire 20 data of approximately maximal variance. The vertical line in the MIG-plot indicates the current iteration.
Figure 2: Complexity scaling in the number of data points for varying dimensionalities of the problem for vanilla BO with a lengthscale of $\bm{\ell} = 0.5$. For $D=18$, the complexity visually differs from an independent kernel after approximately 3000 data points. For $D=24$, 5000 data points are not sufficient to rid independence between observations. The MIG is approximated by sampling evenly distributed data using a SOBOL sequence.
Figure 3: Lower bound on the optimal correlation $\rho^*$ between the incumbent and the upcoming query. a) The GP for two almost-independent observations with a large exploratory region. EI prefers to query close to the incumbent, well within the bound on $\rho^*$ from Prop. \ref{['th:boundary']}. b) Tightness of the bound compared a numerical solve for optimal correlation for various values of $y_{max}$.
Figure 4: We display the model complexity scaling in the dimensionality of the problem for 1000 data points for various HDBO algorithms. Vanilla BO with fixed lengthscales (magenta) approaches independent complexity at approximately 20 dimensions. As expected, REMBO random embeddings (brown) reduce complexity the most, followed by BOCK cylindrical kernels (yellow). The MIG growth of our proposed modification of the global GP (blue) flattens out at a rate similar to cylindrical kernels (yellow), despite modelling the original, full-dimensional space.
Figure 5: Average log regret of all baselines on Levy (4D) and Hartmann (6D) synthetic test functions of varying ambient dimensionality across 20 repetitions (10 for SAASBO). Vanilla BO performs second best, beaten only by SAASBO on four tasks, whose axis-aligned subspace assumption (along with MCTS-VS' variable selection) aligns perfectly with the task at hand. We omit SAASBO from the 1000D benchmarks due to the prohibitive runtime, and RD-UCB and MPD due to a combination of runtime and numerical instability.
...and 17 more figures

Theorems & Definitions (2)

Proposition 4.1: Lower Bound on EI Correlation
Proposition 3.1

Vanilla Bayesian Optimization Performs Great in High Dimensions

TL;DR

Abstract

Vanilla Bayesian Optimization Performs Great in High Dimensions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (2)