Table of Contents
Fetching ...

Unexpected Improvements to Expected Improvement for Bayesian Optimization

Sebastian Ament, Samuel Daulton, David Eriksson, Maximilian Balandat, Eytan Bakshy

TL;DR

This work identifies a fundamental numerical issue in improvement-based Bayesian optimization, where acquisition function values and gradients vanish in large portions of the domain, hindering gradient-based optimization. To address this, the authors introduce LogEI, a family of acquisition functions whose optima align with the canonical counterparts but which are numerically stable; they extend the approach to constrained and multi-objective settings (LogCEI, LogEHVI) and to parallel batch variants (qLogEI, qLogEHVI) using log-space formulations and fat-tailed smooth approximations. Empirically, LogEI variants consistently outperform their EI counterparts across single-objective, constrained, high-dimensional, parallel, and multi-objective benchmarks, often approaching or surpassing state-of-the-art baselines with no added computational burden. The results underscore the importance of numerically robust acquisition optimization, showing joint batch optimization can compete with sequential greedy strategies and that these numerical reforms can meaningfully expand the applicability of Bayesian optimization in practice.

Abstract

Expected Improvement (EI) is arguably the most popular acquisition function in Bayesian optimization and has found countless successful applications, but its performance is often exceeded by that of more recent methods. Notably, EI and its variants, including for the parallel and multi-objective settings, are challenging to optimize because their acquisition values vanish numerically in many regions. This difficulty generally increases as the number of observations, dimensionality of the search space, or the number of constraints grow, resulting in performance that is inconsistent across the literature and most often sub-optimal. Herein, we propose LogEI, a new family of acquisition functions whose members either have identical or approximately equal optima as their canonical counterparts, but are substantially easier to optimize numerically. We demonstrate that numerical pathologies manifest themselves in "classic" analytic EI, Expected Hypervolume Improvement (EHVI), as well as their constrained, noisy, and parallel variants, and propose corresponding reformulations that remedy these pathologies. Our empirical results show that members of the LogEI family of acquisition functions substantially improve on the optimization performance of their canonical counterparts and surprisingly, are on par with or exceed the performance of recent state-of-the-art acquisition functions, highlighting the understated role of numerical optimization in the literature.

Unexpected Improvements to Expected Improvement for Bayesian Optimization

TL;DR

This work identifies a fundamental numerical issue in improvement-based Bayesian optimization, where acquisition function values and gradients vanish in large portions of the domain, hindering gradient-based optimization. To address this, the authors introduce LogEI, a family of acquisition functions whose optima align with the canonical counterparts but which are numerically stable; they extend the approach to constrained and multi-objective settings (LogCEI, LogEHVI) and to parallel batch variants (qLogEI, qLogEHVI) using log-space formulations and fat-tailed smooth approximations. Empirically, LogEI variants consistently outperform their EI counterparts across single-objective, constrained, high-dimensional, parallel, and multi-objective benchmarks, often approaching or surpassing state-of-the-art baselines with no added computational burden. The results underscore the importance of numerically robust acquisition optimization, showing joint batch optimization can compete with sequential greedy strategies and that these numerical reforms can meaningfully expand the applicability of Bayesian optimization in practice.

Abstract

Expected Improvement (EI) is arguably the most popular acquisition function in Bayesian optimization and has found countless successful applications, but its performance is often exceeded by that of more recent methods. Notably, EI and its variants, including for the parallel and multi-objective settings, are challenging to optimize because their acquisition values vanish numerically in many regions. This difficulty generally increases as the number of observations, dimensionality of the search space, or the number of constraints grow, resulting in performance that is inconsistent across the literature and most often sub-optimal. Herein, we propose LogEI, a new family of acquisition functions whose members either have identical or approximately equal optima as their canonical counterparts, but are substantially easier to optimize numerically. We demonstrate that numerical pathologies manifest themselves in "classic" analytic EI, Expected Hypervolume Improvement (EHVI), as well as their constrained, noisy, and parallel variants, and propose corresponding reformulations that remedy these pathologies. Our empirical results show that members of the LogEI family of acquisition functions substantially improve on the optimization performance of their canonical counterparts and surprisingly, are on par with or exceed the performance of recent state-of-the-art acquisition functions, highlighting the understated role of numerical optimization in the literature.
Paper Structure (61 sections, 9 theorems, 50 equations, 25 figures, 1 table)

This paper contains 61 sections, 9 theorems, 50 equations, 25 figures, 1 table.

Key Result

Theorem 1

Suppose $f$ is drawn from a Gaussian process prior $P_f$, $y^* \leq f^*$, $\mu_n, \sigma_n$ are the mean and standard deviation of the posterior $P_f(f | \mathcal{D}_n)$ and $B \in \mathbb{R}$. Then with probability $1 - \delta$, where $\epsilon_n = (f^* - y^*) + \bigl(\sqrt{-2 \log(2\delta)} - B\bigr) \max_{\mathbf x} \sigma_n({\mathbf x})$.

Figures (25)

  • Figure 1: Left: Fraction of points sampled from the domain for which the magnitude of the gradient of EI vanishes to $<\!10^{-10}$ as a function of the number of randomly generated data points $n$ for different dimensions $d$ on the Ackley function. As $n$ increases, EI and its gradients become numerically zero across most of the domain, see App. \ref{['app:sec:addvanishing']} for details. Right: Values of EI and LogEI on a quadratic objective. EI takes on extremely small values on points for which the likelihood of improving over the incumbent is small and is numerically exactly zero in double precision for a large part of the domain ($\approx [5, 13.5]$). The left plot shows that this tends to worsen as the dimensionality of the problem and the number of data points grow, rendering gradient-based optimization of EI futile.
  • Figure 2: Regret and EI acquisition value for the candidates selected by maximizing EI and LogEI on the convex Sum-of-Squares problem. Optimization stalls out for EI after about 75 observations due to vanishing gradients (indicated by the jagged behavior of the acquisition value), while LogEI continues to make steady progress.
  • Figure 3: Best objective value as a function of iterations on the moderately and severely non-convex Michalewicz and Ackley problems for varying numbers of input dimensions. LogEI substantially outperforms both EI and GIBBON, and this gap widens as the problem dimensionality increases. JES performs slightly better than LogEI on Ackley, but for some reason fails on Michalewicz. Notably, JES is almost two orders of magnitude slower than the other acquisition functions (see Appendix \ref{['app:sec:addEmpirical']}).
  • Figure 4: Best feasible objective value as a function of number of function evaluations (iterations) on four engineering design problems with black-box constraints after an initial $2d$ pseudo-random evaluations.
  • Figure 5: Best objective value for parallel BO as a function of the number evaluations for single-objective optimization on the 16-dimensional Ackley function with varying batch sizes $q$. Notably, joint optimization of the batch outperforms sequential greedy optimization.
  • ...and 20 more figures

Theorems & Definitions (16)

  • Theorem 1
  • Lemma 1
  • Lemma 2
  • proof
  • Lemma 3: Asymptotic Expansion
  • proof
  • Lemma 4: Monotonicity and Convexity
  • proof
  • Lemma 5
  • proof
  • ...and 6 more