Table of Contents
Fetching ...

Thompson Sampling For Combinatorial Bandits: Polynomial Regret and Mismatched Sampling Paradox

Raymond Zhang, Richard Combes

TL;DR

This work tackles linear combinatorial bandits with subgaussian rewards under semi-bandit feedback by introducing BG-CTS, a Boosted Gaussian Thompson Sampling algorithm. BG-CTS combines a Gaussian posterior with a time-varying exploration boost to achieve a finite-time regret that scales polynomially with problem size, avoiding the exponential dependence on the action size m that plagues prior TS analyses. A key theoretical insight is a clean-run event-based analysis that bounds transient regret and yields a regret bound of the form O((σ^2 d ln m / Δ_min) ln T + (σ^2 d^2 m ln m / Δ_min) ln ln T) plus a polynomial term, along with a demonstration of the mismatched sampling paradox where mismatched TS can outperform natural TS in some settings. Empirical results confirm that BG-CTS substantially outperforms Beta-based TS and ESCB in moderate-to-large m regimes, highlighting practical gains in finite-time performance and offering new guidance on posterior design for bandit exploration.

Abstract

We consider Thompson Sampling (TS) for linear combinatorial semi-bandits and subgaussian rewards. We propose the first known TS whose finite-time regret does not scale exponentially with the dimension of the problem. We further show the "mismatched sampling paradox": A learner who knows the rewards distributions and samples from the correct posterior distribution can perform exponentially worse than a learner who does not know the rewards and simply samples from a well-chosen Gaussian posterior. The code used to generate the experiments is available at https://github.com/RaymZhang/CTS-Mismatched-Paradox

Thompson Sampling For Combinatorial Bandits: Polynomial Regret and Mismatched Sampling Paradox

TL;DR

This work tackles linear combinatorial bandits with subgaussian rewards under semi-bandit feedback by introducing BG-CTS, a Boosted Gaussian Thompson Sampling algorithm. BG-CTS combines a Gaussian posterior with a time-varying exploration boost to achieve a finite-time regret that scales polynomially with problem size, avoiding the exponential dependence on the action size m that plagues prior TS analyses. A key theoretical insight is a clean-run event-based analysis that bounds transient regret and yields a regret bound of the form O((σ^2 d ln m / Δ_min) ln T + (σ^2 d^2 m ln m / Δ_min) ln ln T) plus a polynomial term, along with a demonstration of the mismatched sampling paradox where mismatched TS can outperform natural TS in some settings. Empirical results confirm that BG-CTS substantially outperforms Beta-based TS and ESCB in moderate-to-large m regimes, highlighting practical gains in finite-time performance and offering new guidance on posterior design for bandit exploration.

Abstract

We consider Thompson Sampling (TS) for linear combinatorial semi-bandits and subgaussian rewards. We propose the first known TS whose finite-time regret does not scale exponentially with the dimension of the problem. We further show the "mismatched sampling paradox": A learner who knows the rewards distributions and samples from the correct posterior distribution can perform exponentially worse than a learner who does not know the rewards and simply samples from a well-chosen Gaussian posterior. The code used to generate the experiments is available at https://github.com/RaymZhang/CTS-Mismatched-Paradox
Paper Structure (30 sections, 10 theorems, 105 equations, 1 figure, 2 algorithms)

This paper contains 30 sections, 10 theorems, 105 equations, 1 figure, 2 algorithms.

Key Result

Theorem 1

For $\lambda = 1$, and $\sigma^2$ subgaussian rewards, the regret of BG-CTS is upper bounded by: with $C,C'$ universal constants and $P$ a polynomial in $m,d,\frac{1}{\Delta_{\min}}, \Delta_{\max}, \sigma$.

Figures (1)

  • Figure :

Theorems & Definitions (24)

  • Theorem 1
  • Proposition 2
  • Proposition 3
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • Lemma 6: Multiplicative Azuma Chernoff
  • proof
  • Lemma 7
  • ...and 14 more