Table of Contents
Fetching ...

Bayesian Analysis of Combinatorial Gaussian Process Bandits

Jack Sandberg, Niklas Åkerblom, Morteza Haghir Chehreghani

TL;DR

This work tackles Bayesian learning in the combinatorial volatile Gaussian process semi‑bandit, where the agent selects a subset (super arm) from a potentially infinite set of base arms each round and observes additive semi‑bandit rewards. It develops and analyzes GP‑UCB, GP‑TS, and the first regret bound for GP‑BayesUCB in this setting, providing finite and infinite arm results via discretization and exploiting the maximal information gain $\gamma_T$ and the kernel‑driven information structure. The authors prove sublinear Bayesian regret bounds of the form $\mathrm{BR}(T)=\mathcal{O}(\sqrt{\lambda^*_K T K \beta_T \gamma_{TK}})$ under suitable conditions and apply the framework to online energy‑efficient navigation on real road networks, demonstrating that TS‑based methods achieve lower regret and better exploration efficiency than UCB variants and Bayesian baselines. The practical impact lies in scalable, context‑aware planning under uncertainty for energy‑aware routing and related combinatorial decision problems.

Abstract

We consider the combinatorial volatile Gaussian process (GP) semi-bandit problem. Each round, an agent is provided a set of available base arms and must select a subset of them to maximize the long-term cumulative reward. We study the Bayesian setting and provide novel Bayesian cumulative regret bounds for three GP-based algorithms: GP-UCB, GP-BayesUCB and GP-TS. Our bounds extend previous results for GP-UCB and GP-TS to the infinite, volatile and combinatorial setting, and to the best of our knowledge, we provide the first regret bound for GP-BayesUCB. Volatile arms encompass other widely considered bandit problems such as contextual bandits. Furthermore, we employ our framework to address the challenging real-world problem of online energy-efficient navigation, where we demonstrate its effectiveness compared to the alternatives.

Bayesian Analysis of Combinatorial Gaussian Process Bandits

TL;DR

This work tackles Bayesian learning in the combinatorial volatile Gaussian process semi‑bandit, where the agent selects a subset (super arm) from a potentially infinite set of base arms each round and observes additive semi‑bandit rewards. It develops and analyzes GP‑UCB, GP‑TS, and the first regret bound for GP‑BayesUCB in this setting, providing finite and infinite arm results via discretization and exploiting the maximal information gain and the kernel‑driven information structure. The authors prove sublinear Bayesian regret bounds of the form under suitable conditions and apply the framework to online energy‑efficient navigation on real road networks, demonstrating that TS‑based methods achieve lower regret and better exploration efficiency than UCB variants and Bayesian baselines. The practical impact lies in scalable, context‑aware planning under uncertainty for energy‑aware routing and related combinatorial decision problems.

Abstract

We consider the combinatorial volatile Gaussian process (GP) semi-bandit problem. Each round, an agent is provided a set of available base arms and must select a subset of them to maximize the long-term cumulative reward. We study the Bayesian setting and provide novel Bayesian cumulative regret bounds for three GP-based algorithms: GP-UCB, GP-BayesUCB and GP-TS. Our bounds extend previous results for GP-UCB and GP-TS to the infinite, volatile and combinatorial setting, and to the best of our knowledge, we provide the first regret bound for GP-BayesUCB. Volatile arms encompass other widely considered bandit problems such as contextual bandits. Furthermore, we employ our framework to address the challenging real-world problem of online energy-efficient navigation, where we demonstrate its effectiveness compared to the alternatives.
Paper Structure (31 sections, 20 theorems, 59 equations, 12 figures, 2 tables, 3 algorithms)

This paper contains 31 sections, 20 theorems, 59 equations, 12 figures, 2 tables, 3 algorithms.

Key Result

lemma 1

Let $C_\omega = \left( \sqrt{\pi} \omega / \sqrt{2e (\omega - 1)} \right)^{1/\omega}$, then for GP-BUCB with confidence parameter $\beta_t = 2 \left( \mathop{\mathrm{erf}}\nolimits^{-1}(1 - 2 \eta_t ) \right)^2$ and $\eta_t = \frac{\sqrt{2\pi}^\omega}{2|\mathcal{A}|^\omega t^\xi}$, $\xi > 0$, $\omeg

Figures (12)

  • Figure 1: Road networks of Luxembourg (left) and Monaco (right) with evaluation routes A and B highlighted in blue and red.
  • Figure 2: Cumulative regret for UCB, BUCB and TS using GP and Bayesian inference (BI) methods. The lines and regions correspond to the mean and $\pm 1$ standard error.
  • Figure 3: Cumulative regret of GP-UCB and GP-BUCB (left and middle column) for different parametrizations of $\beta_t$ (right column).
  • Figure 4: Cumulative regret of GP-BUCB, BI-BUCB, GP-TS and BI-TS for varying prior lengthscale values $\ell$.
  • Figure 5: Cumulative regret at $t = 500$ for varying prior lengthscale values. Errorbars correspond to $\pm 1$ standard error.
  • ...and 7 more figures

Theorems & Definitions (38)

  • lemma 1
  • theorem 3.1: Finite regret bounds
  • lemma 2
  • theorem 3.2: Infinite regret bounds
  • lemma 3
  • proof
  • lemma 4
  • proof
  • lemma 4
  • proof
  • ...and 28 more