Table of Contents
Fetching ...

Differentiating Policies for Non-Myopic Bayesian Optimization

Darian Nwankwo, David Bindel

TL;DR

This work addresses the computational intractability of non-myopic Bayesian optimization by developing rollout acquisition functions that estimate horizon-$h$ value using trajectory-based simulations and differentiable base policies. It introduces a GP-based modeling framework with trajectory-aware notation and leverages variance-reduction techniques (QMC, CRN, and control variates) to enable gradient-based optimization of sampling policies. Empirical results on synthetic benchmarks show non-myopic rollout methods often outperform myopic approaches like EI and POI, highlighting practical gains in sample efficiency. The study also discusses limitations and directions for refining horizon selection and differentiation techniques for broader applicability.

Abstract

Bayesian optimization (BO) methods choose sample points by optimizing an acquisition function derived from a statistical model of the objective. These acquisition functions are chosen to balance sampling regions with predicted good objective values against exploring regions where the objective is uncertain. Standard acquisition functions are myopic, considering only the impact of the next sample, but non-myopic acquisition functions may be more effective. In principle, one could model the sampling by a Markov decision process, and optimally choose the next sample by maximizing an expected reward computed by dynamic programming; however, this is infeasibly expensive. More practical approaches, such as rollout, consider a parametric family of sampling policies. In this paper, we show how to efficiently estimate rollout acquisition functions and their gradients, enabling stochastic gradient-based optimization of sampling policies.

Differentiating Policies for Non-Myopic Bayesian Optimization

TL;DR

This work addresses the computational intractability of non-myopic Bayesian optimization by developing rollout acquisition functions that estimate horizon- value using trajectory-based simulations and differentiable base policies. It introduces a GP-based modeling framework with trajectory-aware notation and leverages variance-reduction techniques (QMC, CRN, and control variates) to enable gradient-based optimization of sampling policies. Empirical results on synthetic benchmarks show non-myopic rollout methods often outperform myopic approaches like EI and POI, highlighting practical gains in sample efficiency. The study also discusses limitations and directions for refining horizon selection and differentiation techniques for broader applicability.

Abstract

Bayesian optimization (BO) methods choose sample points by optimizing an acquisition function derived from a statistical model of the objective. These acquisition functions are chosen to balance sampling regions with predicted good objective values against exploring regions where the objective is uncertain. Standard acquisition functions are myopic, considering only the impact of the next sample, but non-myopic acquisition functions may be more effective. In principle, one could model the sampling by a Markov decision process, and optimally choose the next sample by maximizing an expected reward computed by dynamic programming; however, this is infeasibly expensive. More practical approaches, such as rollout, consider a parametric family of sampling policies. In this paper, we show how to efficiently estimate rollout acquisition functions and their gradients, enabling stochastic gradient-based optimization of sampling policies.
Paper Structure (20 sections, 48 equations, 4 figures, 1 table)

This paper contains 20 sections, 48 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The first graph is the rollout acquisition function for a non-trivial horizon. Subsequently, we depict the estimated gradient and their respective standard errors.
  • Figure 2: The first graph demonstrates the cost of increasing your horizon as your iterations progress. Subsequently, we depict the multiplicative cost of looking further ahead relative to the rollout acquisition function with $h=1$. Note, $h=1$ corresponds to a two-step lookahead strategy.
  • Figure 3: The average GAP and standard error across a suite of 15 synthetic test functions with 60 randomized trials and a single random observation as an initialization of the underlying GP model. All nonmyopic strategies outperform POI and most outperform EI.
  • Figure 4: A histogram depicting the average winning strategies.