Table of Contents
Fetching ...

Feel-Good Thompson Sampling for Contextual Bandits: a Markov Chain Monte Carlo Showdown

Emile Anand, Sarah Liaw

TL;DR

The paper tackles exploration in high-dimensional contextual bandits by evaluating Feel-Good Thompson Sampling (FG-TS) and its smoothed variant (SFG-TS) when posterior posteriors are approximated by MCMC methods. FG-TS adds an optimism bonus to the likelihood to boost exploration, while SFG-TS smooths this bonus to enable tractable online sampling; both are benchmarked across linear, logistic, and neural settings with diverse MCMC algorithms (LMC, MALA, HMC, ULMC) and enhancements. Across fourteen datasets, FG-TS and SFG-TS reduce regret relative to vanilla TS in linear and logistic settings with exact posteriors, but neural bandits show limited gains and can even degrade with large bonuses under approximation error. The paper concludes that FG-TS variants are competitive and useful baselines for modern contextual-bandit benchmarks, while highlighting the critical role of posterior quality and hyperparameter tuning in online MCMC-based exploration.

Abstract

Thompson Sampling (TS) is widely used to address the exploration/exploitation tradeoff in contextual bandits, yet recent theory shows that it does not explore aggressively enough in high-dimensional problems. Feel-Good Thompson Sampling (FG-TS) addresses this by adding an optimism bonus that biases toward high-reward models, and it achieves the asymptotically minimax-optimal regret in the linear setting when posteriors are exact. However, its performance with \emph{approximate} posteriors -- common in large-scale or neural problems -- has not been benchmarked. We provide the first systematic study of FG-TS and its smoothed variant (SFG-TS) across eleven real-world and synthetic benchmarks. To evaluate their robustness, we compare performance across settings with exact posteriors (linear and logistic bandits) to approximate regimes produced by fast but coarse stochastic-gradient samplers. Ablations over preconditioning, bonus scale, and prior strength reveal a trade-off: larger bonuses help when posterior samples are accurate, but hurt when sampling noise dominates. FG-TS generally outperforms vanilla TS in linear and logistic bandits, but tends to be weaker in neural bandits. Nevertheless, because FG-TS and its variants are competitive and easy-to-use, we recommend them as baselines in modern contextual-bandit benchmarks. Finally, we provide source code for all our experiments in https://github.com/SarahLiaw/ctx-bandits-mcmc-showdown.

Feel-Good Thompson Sampling for Contextual Bandits: a Markov Chain Monte Carlo Showdown

TL;DR

The paper tackles exploration in high-dimensional contextual bandits by evaluating Feel-Good Thompson Sampling (FG-TS) and its smoothed variant (SFG-TS) when posterior posteriors are approximated by MCMC methods. FG-TS adds an optimism bonus to the likelihood to boost exploration, while SFG-TS smooths this bonus to enable tractable online sampling; both are benchmarked across linear, logistic, and neural settings with diverse MCMC algorithms (LMC, MALA, HMC, ULMC) and enhancements. Across fourteen datasets, FG-TS and SFG-TS reduce regret relative to vanilla TS in linear and logistic settings with exact posteriors, but neural bandits show limited gains and can even degrade with large bonuses under approximation error. The paper concludes that FG-TS variants are competitive and useful baselines for modern contextual-bandit benchmarks, while highlighting the critical role of posterior quality and hyperparameter tuning in online MCMC-based exploration.

Abstract

Thompson Sampling (TS) is widely used to address the exploration/exploitation tradeoff in contextual bandits, yet recent theory shows that it does not explore aggressively enough in high-dimensional problems. Feel-Good Thompson Sampling (FG-TS) addresses this by adding an optimism bonus that biases toward high-reward models, and it achieves the asymptotically minimax-optimal regret in the linear setting when posteriors are exact. However, its performance with \emph{approximate} posteriors -- common in large-scale or neural problems -- has not been benchmarked. We provide the first systematic study of FG-TS and its smoothed variant (SFG-TS) across eleven real-world and synthetic benchmarks. To evaluate their robustness, we compare performance across settings with exact posteriors (linear and logistic bandits) to approximate regimes produced by fast but coarse stochastic-gradient samplers. Ablations over preconditioning, bonus scale, and prior strength reveal a trade-off: larger bonuses help when posterior samples are accurate, but hurt when sampling noise dominates. FG-TS generally outperforms vanilla TS in linear and logistic bandits, but tends to be weaker in neural bandits. Nevertheless, because FG-TS and its variants are competitive and easy-to-use, we recommend them as baselines in modern contextual-bandit benchmarks. Finally, we provide source code for all our experiments in https://github.com/SarahLiaw/ctx-bandits-mcmc-showdown.

Paper Structure

This paper contains 21 sections, 7 theorems, 30 equations, 1 figure, 6 tables, 4 algorithms.

Key Result

Lemma C.1

If $\mV_t \in \mathbb{R}^{d\times d}$ is an invertible symmetric matrix and $\cL:\mathbb{R}^d\to\mathbb{R}$ is a smooth function, the preconditioned FGLMCTS dynamics corresponding to the stochastic differential equation given by converges to a unique stationary distribution $\pi(\mathrm d\theta)\propto e^{-\beta_t \cL_t(\theta)}\mathrm d\theta$.

Figures (1)

  • Figure 1: Mean regret comparison on simulated bandit problems. The shaded band around each mean curve represents $\pm 1$ sample standard deviation across 5 independent runs.

Theorems & Definitions (13)

  • Lemma C.1: Convergence of Preconditioned Langevin Dynamics
  • proof
  • Theorem C.3: Theorem 5 in huix2023tight
  • Theorem C.4
  • proof
  • Remark C.4.1
  • Theorem C.5: Morse-Sard Theorem sard1942measure
  • Lemma C.6
  • proof
  • Lemma C.7
  • ...and 3 more