Table of Contents
Fetching ...

Diffusion Approximations for Thompson Sampling in the Small Gap Regime

Lin Fan, Peter W. Glynn

TL;DR

This work analyzes Thompson sampling in the small-gap regime, where arm gaps are $O(\sqrt{\gamma})$ and the horizon is $O(1/\gamma)$, and proves diffusion-limit descriptions as $\gamma \downarrow 0$. By expressing the dynamics through scaled processes and applying the Continuous Mapping Theorem, the authors derive SDE and stochastic-ODE limits for Gaussian TS and extend these to exponential-family and bootstrap variants, establishing an invariance principle that many TS-like algorithms share the same weak limit. They also show robustness to model mis-specification in this regime and discuss batched updates, yielding practical diffusion-based insights into the distribution of regret. The results rely on Bernstein–von Mises-type posterior approximations, Knight-time changes, and a suite of weak-convergence tools to connect discrete TS dynamics to continuous stochastic processes. Overall, the paper provides a unified, principled diffusion framework for understanding and approximating the early-stage, minimax-like behavior of sampling-based bandits.

Abstract

We study the process-level dynamics of Thompson sampling in the ``small gap'' regime. The small gap regime is one in which the gaps between the arm means are of order $\sqrtγ$ or smaller and the time horizon is of order $1/γ$, where $γ$ is small. As $γ\downarrow 0$, we show that the process-level dynamics of Thompson sampling converge weakly to the solutions to certain stochastic differential equations and stochastic ordinary differential equations. Our weak convergence theory is developed from first principles using the Continuous Mapping Theorem, can handle stationary, weakly dependent reward processes, and can also be adapted to analyze a variety of sampling-based bandit algorithms. Indeed, we show that the process-level dynamics of many sampling-based bandit algorithms -- including Thompson sampling designed for any single-parameter exponential family of rewards, as well as non-parametric bandit algorithms based on bootstrap re-sampling -- satisfy an invariance principle. Namely, their weak limits coincide with that of Gaussian parametric Thompson sampling with Gaussian priors. Moreover, in the small gap regime, the regret performance of these algorithms is generally insensitive to model mis-specification, changing continuously with increasing degrees of mis-specification.

Diffusion Approximations for Thompson Sampling in the Small Gap Regime

TL;DR

This work analyzes Thompson sampling in the small-gap regime, where arm gaps are and the horizon is , and proves diffusion-limit descriptions as . By expressing the dynamics through scaled processes and applying the Continuous Mapping Theorem, the authors derive SDE and stochastic-ODE limits for Gaussian TS and extend these to exponential-family and bootstrap variants, establishing an invariance principle that many TS-like algorithms share the same weak limit. They also show robustness to model mis-specification in this regime and discuss batched updates, yielding practical diffusion-based insights into the distribution of regret. The results rely on Bernstein–von Mises-type posterior approximations, Knight-time changes, and a suite of weak-convergence tools to connect discrete TS dynamics to continuous stochastic processes. Overall, the paper provides a unified, principled diffusion framework for understanding and approximating the early-stage, minimax-like behavior of sampling-based bandits.

Abstract

We study the process-level dynamics of Thompson sampling in the ``small gap'' regime. The small gap regime is one in which the gaps between the arm means are of order or smaller and the time horizon is of order , where is small. As , we show that the process-level dynamics of Thompson sampling converge weakly to the solutions to certain stochastic differential equations and stochastic ordinary differential equations. Our weak convergence theory is developed from first principles using the Continuous Mapping Theorem, can handle stationary, weakly dependent reward processes, and can also be adapted to analyze a variety of sampling-based bandit algorithms. Indeed, we show that the process-level dynamics of many sampling-based bandit algorithms -- including Thompson sampling designed for any single-parameter exponential family of rewards, as well as non-parametric bandit algorithms based on bootstrap re-sampling -- satisfy an invariance principle. Namely, their weak limits coincide with that of Gaussian parametric Thompson sampling with Gaussian priors. Moreover, in the small gap regime, the regret performance of these algorithms is generally insensitive to model mis-specification, changing continuously with increasing degrees of mis-specification.

Paper Structure

This paper contains 19 sections, 22 theorems, 138 equations.

Key Result

Theorem 1

Consider a $K$-armed bandit in the small gap regime of Assumption assumption0 (with iid rewards for each arm) and the random table model of reward feedback. For the Gaussian Thompson sampler with prior variance scaling as $\gamma$, we have as $\gamma \downarrow 0$ in $D^{2K}[0,\infty)$, where $(U,S)$ is the unique strong solution to the SDE: with standard $K$-dimensional Brownian motion $B$, and

Theorems & Definitions (32)

  • Remark 1
  • Remark 2
  • Definition 1: $\epsilon$-warm-start
  • Remark 3
  • Theorem 1
  • Definition 2
  • Theorem 2
  • Remark 4
  • Theorem 3
  • Proposition 1
  • ...and 22 more