Table of Contents
Fetching ...

Batch Acquisition Function Evaluations and Decouple Optimizer Updates for Faster Bayesian Optimization

Kaichi Irie, Shuhei Watanabe, Masaki Onishi

TL;DR

The paper identifies a bottleneck in Bayesian optimization: accelerating acquisition-function optimization across multiple restarts via batching (C-BE) introduces off-diagonal artifacts in the inverse Hessian that slow convergence. It proposes Decoupled Batch Evaluations (D-BE), which uses a coroutine to decouple per-restart quasi-Newton updates from batched evaluations, preserving per-restart curvature while leveraging hardware throughput. The method achieves identical convergence to sequential MSO with substantially reduced wall-clock time, outperforming C-BE, and is demonstrated across multiple benchmark functions, with notable speedups up to 1.5x. The approach has been merged into GPSampler in Optuna, delivering practical, deployable speedups for Bayesian optimization workflows.

Abstract

Bayesian optimization (BO) efficiently finds high-performing parameters by maximizing an acquisition function, which models the promise of parameters. A major computational bottleneck arises in acquisition function optimization, where multi-start optimization (MSO) with quasi-Newton (QN) methods is required due to the non-convexity of the acquisition function. BoTorch, a widely used BO library, currently optimizes the summed acquisition function over multiple points, leading to the speedup of MSO owing to PyTorch batching. Nevertheless, this paper empirically demonstrates the suboptimality of this approach in terms of off-diagonal approximation errors in the inverse Hessian of a QN method, slowing down its convergence. To address this problem, we propose to decouple QN updates using a coroutine while batching the acquisition function calls. Our approach not only yields the theoretically identical convergence to the sequential MSO but also drastically reduces the wall-clock time compared to the previous approaches. Our approach is available in GPSampler in Optuna, effectively reducing its computational overhead.

Batch Acquisition Function Evaluations and Decouple Optimizer Updates for Faster Bayesian Optimization

TL;DR

The paper identifies a bottleneck in Bayesian optimization: accelerating acquisition-function optimization across multiple restarts via batching (C-BE) introduces off-diagonal artifacts in the inverse Hessian that slow convergence. It proposes Decoupled Batch Evaluations (D-BE), which uses a coroutine to decouple per-restart quasi-Newton updates from batched evaluations, preserving per-restart curvature while leveraging hardware throughput. The method achieves identical convergence to sequential MSO with substantially reduced wall-clock time, outperforming C-BE, and is demonstrated across multiple benchmark functions, with notable speedups up to 1.5x. The approach has been merged into GPSampler in Optuna, delivering practical, deployable speedups for Bayesian optimization workflows.

Abstract

Bayesian optimization (BO) efficiently finds high-performing parameters by maximizing an acquisition function, which models the promise of parameters. A major computational bottleneck arises in acquisition function optimization, where multi-start optimization (MSO) with quasi-Newton (QN) methods is required due to the non-convexity of the acquisition function. BoTorch, a widely used BO library, currently optimizes the summed acquisition function over multiple points, leading to the speedup of MSO owing to PyTorch batching. Nevertheless, this paper empirically demonstrates the suboptimality of this approach in terms of off-diagonal approximation errors in the inverse Hessian of a QN method, slowing down its convergence. To address this problem, we propose to decouple QN updates using a coroutine while batching the acquisition function calls. Our approach not only yields the theoretically identical convergence to the sequential MSO but also drastically reduces the wall-clock time compared to the previous approaches. Our approach is available in GPSampler in Optuna, effectively reducing its computational overhead.

Paper Structure

This paper contains 13 sections, 2 equations, 5 figures, 2 tables, 2 algorithms.

Figures (5)

  • Figure 1: Contour maps of the inverse Hessian (Left) and the inverse Hessians approximated by L-BFGS-B with Seq. Opt. (Center) and C-BE (Right) on the Rosenbrock function, evaluated near the constrained minimizer ($B = 3, D = 5, \mathbf{x} \in [0, 3]^D$). Each figure has $15 \times 15$ tiles, and the $(i, j)$-th tile corresponds to the colormap for the $(i,j)$-th element of the (approximated) inverse Hessian. Blue and yellow represent lower and higher values, respectively. Each subtitle reports $e_{\mathrm{rel}}(H)=\|H-H_{\mathrm{true}}\|_{F}/\|H_{\mathrm{true}}\|_{F}$. Left: The true inverse Hessian exhibits zero at off-diagonal blocks. Center: The approximated inverse Hessian by Seq. Opt.. Off-diagonal blocks show zero everywhere. Right: The approximated inverse Hessian by C-BE. Off-diagonal blocks are dense because C-BE does not allow QN methods to be aware of the zero off-diagonal block nature.
  • Figure 2: Convergence speed of C-BE when using L-BFGS-B with the memory size $m=10$ on the Rosenbrock function ($D=5, \mathbf{x} \in [0,3]^D$). The figure shows the objective mean over $B$ restarts at each iteration. Each optimization is repeated $1000 / B$ times. Each solid line and weak-color band represents the median and the $\pm$ IQR of the objective mean over $1000 / B$ runs, respectively. Seq. Opt. corresponds to $B=1$. As the number of restarts $B$ increases, the convergence of C-BE requires substantially more iterations.
  • Figure 3: Contour maps of the inverse Hessian (Left) and its approximations by Seq. Opt. (Center) and C-BE (Right). This figure follows the same setup as Figure \ref{['fig:hessian_mismatch']} except that BFGS, i.e., the memory is not limited, is used instead of L-BFGS-B.
  • Figure 4: Contour maps of the inverse Hessian (Left) and its approximations by Seq. Opt. (Center) and C-BE (Right). This figure follows the same setup as Figure \ref{['fig:hessian_mismatch']} except that BFGS, i.e., the memory is not limited, is used instead of L-BFGS-B, and $B=10$ is used instead of $B=3$. Off-diagonal artifacts are more prominent for a large $B$.
  • Figure 5: Convergence speed of C-BE when using BFGS. This figure follows the same setup as Figure \ref{['fig:convergence_rosenbrock']} except that BFGS is used instead of L-BFGS-B. As the number of restarts $B$ increases, the convergence of C-BE requires substantially more iterations.