Nonparametric Bayesian Optimization for General Rewards

Zishi Zhang; Tao Ren; Yijie Peng

Nonparametric Bayesian Optimization for General Rewards

Zishi Zhang, Tao Ren, Yijie Peng

TL;DR

The paper tackles Bayesian optimization when the reward distribution is uncertain and potentially nonstandard. It introduces the infinity Gaussian process (∞-GP), a Bayesian nonparametric surrogate that places a prior over reward distributions via a sequential spatial Dirichlet process mixture, yielding an infinite mixture of GP surfaces. When combined with Thompson Sampling, the ∞-GP achieves no-regret guarantees under very mild conditions (Lipschitz objective and general noise tails), outperforming classical GP surrogates in heavy-tailed and nonstationary settings. Computationally, a truncated Gibbs sampler ensures scalability with only a logarithmic growth in realized surfaces, making the approach practical for large-scale BO tasks and real-world applications.

Abstract

This work focuses on Bayesian optimization (BO) under reward model uncertainty. We propose the first BO algorithm that achieves no-regret guarantee in a general reward setting, requiring only Lipschitz continuity of the objective function and accommodating a broad class of measurement noise. The core of our approach is a novel surrogate model, termed as infinite Gaussian process ($\infty$-GP). It is a Bayesian nonparametric model that places a prior on the space of reward distributions, enabling it to represent a substantially broader class of reward models than classical Gaussian process (GP). The $\infty$-GP is used in combination with Thompson Sampling (TS) to enable effective exploration and exploitation. Correspondingly, we develop a new TS regret analysis framework for general rewards, which relates the regret to the total variation distance between the surrogate model and the true reward distribution. Furthermore, with a truncated Gibbs sampling procedure, our method is computationally scalable, incurring minimal additional memory and computational complexities compared to classical GP. Empirical results demonstrate state-of-the-art performance, particularly in settings with non-stationary, heavy-tailed, or other ill-conditioned rewards.

Nonparametric Bayesian Optimization for General Rewards

TL;DR

Abstract

-GP). It is a Bayesian nonparametric model that places a prior on the space of reward distributions, enabling it to represent a substantially broader class of reward models than classical Gaussian process (GP). The

-GP is used in combination with Thompson Sampling (TS) to enable effective exploration and exploitation. Correspondingly, we develop a new TS regret analysis framework for general rewards, which relates the regret to the total variation distance between the surrogate model and the true reward distribution. Furthermore, with a truncated Gibbs sampling procedure, our method is computationally scalable, incurring minimal additional memory and computational complexities compared to classical GP. Empirical results demonstrate state-of-the-art performance, particularly in settings with non-stationary, heavy-tailed, or other ill-conditioned rewards.

Paper Structure (43 sections, 11 theorems, 112 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 43 sections, 11 theorems, 112 equations, 6 figures, 3 tables, 2 algorithms.

Introduction
Related Works
Problem Formulation and Background
Limitations of Classical BO Methods and GP Surrogate Model
Model Uncertainty vs. Value Uncertainty
$\infty$-Gaussian-Process Surrogate Modeling
Predictive Distribution and Thompson Sampling
Predictive Distribution
$\infty$-GP Thompson Sampling
Theoretical Analysis: Understanding the Superiority of $\infty$-GP over GP
Broad Convergence of $\infty$-GP Model
Regret Analysis for General Reward Distributions
A Central Lemma.
A Discretization-then-Decoupling Proof Technique
Step 1: Decision Discretization.
...and 28 more sections

Key Result

Proposition 1

The predictive reward distribution for any unexplored solution $x_{n+1}\in\mathcal{X}$ is given by The term A, $\textcolor{blue}{[y(x_{n+1}) \mid \Theta^{(1)}, \xi^{(z_{n+1})}(x_{n+1})]}$, follows $\mathcal{N}(x_{n+1}^\top \beta + \xi^{(z_{n+1})}(x_{n+1}), \tau^2)$.

Figures (6)

Figure 1: (a)–(b): Draw a stochastic process from the distribution space, representing model uncertainty. (b)–(c): Draw a realization from the stochastic process, representing value uncertainty. $\infty$-GP model includes (a)–(c), whereas the GP model includes only (b)–(c).
Figure 2: $\infty$-GP is theoretically an infinite sum of GP surfaces. The solid dots means that the $\xi(x_i)$ is realized on that surface and the hollow ones represent that an unrealized latent variable. Red dots means predicting the new response at an unexplored $x_{new}$. Red dots on solid lines represent new observations being realized on an existing surface, while dashed lines indicate the observation lies on a new surface drawn from the base distribution.
Figure 3: Illustration of the modeling-capacity limitation of GP surrogates. Even with infinite data, the best Gaussian approximation cannot match a $t$-distribution or a bimodal distribution, leaving a persistent TVD gap (blue shaded region).
Figure 4: Cumulative regret comparison across BO algorithms
Figure 5: Weights of each surfaces over optimization of Ackley-NS.
...and 1 more figures

Theorems & Definitions (17)

Definition 1: RKHS Assumption
Proposition 1: Predictive Distribution
Theorem 1: Convergence of the $\infty$-GP Model
Remark 1: Mildness of the Assumption \ref{['ass_guangzhi']}
Lemma 1: Regret Bound via TVD
Remark 2
Lemma 2: Regret bound via Sampling Decoupling
Theorem 2
Remark 3: Mildness of Assumption
Proposition 2: Slow Growth in the Number of Realized Surfaces
...and 7 more

Nonparametric Bayesian Optimization for General Rewards

TL;DR

Abstract

Nonparametric Bayesian Optimization for General Rewards

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (17)