Table of Contents
Fetching ...

We Still Don't Understand High-Dimensional Bayesian Optimization

Colin Doumont, Donney Fan, Natalie Maus, Jacob R. Gardner, Henry Moss, Geoff Pleiss

TL;DR

The paper challenges the prevailing belief that high-dimensional Bayesian optimization requires structurally tuned, complex surrogates. It demonstrates that a Bayesian linear regression surrogate, when paired with a carefully designed spherical input mapping and a decoupled, dimension-aware lengthscale, can match or exceed state-of-the-art performance across $D$ from 60 to over 6000 and in regimes where $N$ is both comparable to and much larger than $D$. By enabling exact Thompson sampling and scalable posterior inference, the approach also offers practical advantages for large-$N$ problems such as molecular optimization in latent spaces. These results prompt a rethink of assumptions about model complexity in HDBO and highlight geometric considerations as a key driver of optimization performance.

Abstract

High-dimensional spaces have challenged Bayesian optimization (BO). Existing methods aim to overcome this so-called curse of dimensionality by carefully encoding structural assumptions, from locality to sparsity to smoothness, into the optimization procedure. Surprisingly, we demonstrate that these approaches are outperformed by arguably the simplest method imaginable: Bayesian linear regression. After applying a geometric transformation to avoid boundary-seeking behavior, Gaussian processes with linear kernels match state-of-the-art performance on tasks with 60- to 6,000-dimensional search spaces. Linear models offer numerous advantages over their non-parametric counterparts: they afford closed-form sampling and their computation scales linearly with data, a fact we exploit on molecular optimization tasks with > 20,000 observations. Coupled with empirical analyses, our results suggest the need to depart from past intuitions about BO methods in high-dimensional spaces.

We Still Don't Understand High-Dimensional Bayesian Optimization

TL;DR

The paper challenges the prevailing belief that high-dimensional Bayesian optimization requires structurally tuned, complex surrogates. It demonstrates that a Bayesian linear regression surrogate, when paired with a carefully designed spherical input mapping and a decoupled, dimension-aware lengthscale, can match or exceed state-of-the-art performance across from 60 to over 6000 and in regimes where is both comparable to and much larger than . By enabling exact Thompson sampling and scalable posterior inference, the approach also offers practical advantages for large- problems such as molecular optimization in latent spaces. These results prompt a rethink of assumptions about model complexity in HDBO and highlight geometric considerations as a key driver of optimization performance.

Abstract

High-dimensional spaces have challenged Bayesian optimization (BO). Existing methods aim to overcome this so-called curse of dimensionality by carefully encoding structural assumptions, from locality to sparsity to smoothness, into the optimization procedure. Surprisingly, we demonstrate that these approaches are outperformed by arguably the simplest method imaginable: Bayesian linear regression. After applying a geometric transformation to avoid boundary-seeking behavior, Gaussian processes with linear kernels match state-of-the-art performance on tasks with 60- to 6,000-dimensional search spaces. Linear models offer numerous advantages over their non-parametric counterparts: they afford closed-form sampling and their computation scales linearly with data, a fact we exploit on molecular optimization tasks with > 20,000 observations. Coupled with empirical analyses, our results suggest the need to depart from past intuitions about BO methods in high-dimensional spaces.

Paper Structure

This paper contains 55 sections, 2 theorems, 23 equations, 13 figures.

Key Result

Theorem 1

For acquisition functions increasing in posterior mean and variance (e.g. expected improvement), Bayesian linear models will maximize acquisition on the boundary of the search space. That is, for $\mathbf{x}_{t+1} = \arg \max_{\mathbf{x} \in [-1, 1]^D} \alpha_t(\mathbf{x})$ at any timestep $t$, we h

Figures (13)

  • Figure 1: Our linear kernel on spherically-mapped inputs matches state-of-the-art high-dimensional BO performance. (Left): benchmarks with $N \approx D$ evaluation budgets. While standard linear kernels can fail to make any optimization progress, our modified kernel matches or exceeds competitive methods. (Right): benchmarks with $N \gg D$. The natural scalability of linear kernels, coupled with the improved optimization performance afforded by our spherical mapping, yields new state-of-the-art results on large-$N$ tasks.
  • Figure 2: Extension of our spherical linear kernel to order-$m$ polynomial kernels. Higher-order polynomials do not improve upon our linear model ($m=1$).
  • Figure 3: Ablation over the spherical mapping function used by our modified linear kernel. While most spherical mappings improve upon the unmodified inputs (red line), the inverse stereographic projection (orange line) outperforms all other mappings.
  • Figure 4: Differences in the acquired inputs of standard versus modified linear models. (Left): the "boundary %" depicts, for each acquisition $\mathbf x_t$, how many dimensions of the vector lie on the boundary of $\mathcal{X}$ (i.e. $\pm 1$). Standard linear models lead to acquisitions with points nearly in the corners of the hypercube (i.e. $100\%$ of dimensions on the boundary). Our linear model acquires non-corner points ($\approx 75\%$ boundary on MOPTA08) and interior points ($\approx 0\%$ boundary on SVM). (Right): length of the shortest path connecting all acquired observations (OTSD). The corner-searching behavior of the standard linear model leads to acquisitions that are spread out over $\mathcal{X}$, whereas our modified linear model yields locality that is comparable to RBF-based models.
  • Figure 5: Spherical mappings affect BO performance, but not supervised regression performance. (Left): predictive RMSE of GPs trained and tested on quasi-random data. The standard and modified linear models make worse predictions than the RBF model. (Right): predictive RMSE on adaptively-chosen BO acquisitions. Unlike with random data, linear models match the ability of RBF models for predictions at future BO acquisitions from prior ones.
  • ...and 8 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 1
  • proof