Diffusion Approximations for Thompson Sampling in the Small Gap Regime
Lin Fan, Peter W. Glynn
TL;DR
This work analyzes Thompson sampling in the small-gap regime, where arm gaps are $O(\sqrt{\gamma})$ and the horizon is $O(1/\gamma)$, and proves diffusion-limit descriptions as $\gamma \downarrow 0$. By expressing the dynamics through scaled processes and applying the Continuous Mapping Theorem, the authors derive SDE and stochastic-ODE limits for Gaussian TS and extend these to exponential-family and bootstrap variants, establishing an invariance principle that many TS-like algorithms share the same weak limit. They also show robustness to model mis-specification in this regime and discuss batched updates, yielding practical diffusion-based insights into the distribution of regret. The results rely on Bernstein–von Mises-type posterior approximations, Knight-time changes, and a suite of weak-convergence tools to connect discrete TS dynamics to continuous stochastic processes. Overall, the paper provides a unified, principled diffusion framework for understanding and approximating the early-stage, minimax-like behavior of sampling-based bandits.
Abstract
We study the process-level dynamics of Thompson sampling in the ``small gap'' regime. The small gap regime is one in which the gaps between the arm means are of order $\sqrtγ$ or smaller and the time horizon is of order $1/γ$, where $γ$ is small. As $γ\downarrow 0$, we show that the process-level dynamics of Thompson sampling converge weakly to the solutions to certain stochastic differential equations and stochastic ordinary differential equations. Our weak convergence theory is developed from first principles using the Continuous Mapping Theorem, can handle stationary, weakly dependent reward processes, and can also be adapted to analyze a variety of sampling-based bandit algorithms. Indeed, we show that the process-level dynamics of many sampling-based bandit algorithms -- including Thompson sampling designed for any single-parameter exponential family of rewards, as well as non-parametric bandit algorithms based on bootstrap re-sampling -- satisfy an invariance principle. Namely, their weak limits coincide with that of Gaussian parametric Thompson sampling with Gaussian priors. Moreover, in the small gap regime, the regret performance of these algorithms is generally insensitive to model mis-specification, changing continuously with increasing degrees of mis-specification.
