Bayesian Analysis of Combinatorial Gaussian Process Bandits
Jack Sandberg, Niklas Åkerblom, Morteza Haghir Chehreghani
TL;DR
This work tackles Bayesian learning in the combinatorial volatile Gaussian process semi‑bandit, where the agent selects a subset (super arm) from a potentially infinite set of base arms each round and observes additive semi‑bandit rewards. It develops and analyzes GP‑UCB, GP‑TS, and the first regret bound for GP‑BayesUCB in this setting, providing finite and infinite arm results via discretization and exploiting the maximal information gain $\gamma_T$ and the kernel‑driven information structure. The authors prove sublinear Bayesian regret bounds of the form $\mathrm{BR}(T)=\mathcal{O}(\sqrt{\lambda^*_K T K \beta_T \gamma_{TK}})$ under suitable conditions and apply the framework to online energy‑efficient navigation on real road networks, demonstrating that TS‑based methods achieve lower regret and better exploration efficiency than UCB variants and Bayesian baselines. The practical impact lies in scalable, context‑aware planning under uncertainty for energy‑aware routing and related combinatorial decision problems.
Abstract
We consider the combinatorial volatile Gaussian process (GP) semi-bandit problem. Each round, an agent is provided a set of available base arms and must select a subset of them to maximize the long-term cumulative reward. We study the Bayesian setting and provide novel Bayesian cumulative regret bounds for three GP-based algorithms: GP-UCB, GP-BayesUCB and GP-TS. Our bounds extend previous results for GP-UCB and GP-TS to the infinite, volatile and combinatorial setting, and to the best of our knowledge, we provide the first regret bound for GP-BayesUCB. Volatile arms encompass other widely considered bandit problems such as contextual bandits. Furthermore, we employ our framework to address the challenging real-world problem of online energy-efficient navigation, where we demonstrate its effectiveness compared to the alternatives.
