Optimism Stabilizes Thompson Sampling for Adaptive Inference

Shunxing Yan; Han Zhong

Optimism Stabilizes Thompson Sampling for Adaptive Inference

Shunxing Yan, Han Zhong

TL;DR

This work addresses the challenge of conducting valid statistical inference under adaptive data collection for Thompson sampling in Gaussian K-armed bandits. It identifies optimism as the unifying mechanism that stabilizes TS, showing that two principled variants—variance inflation and a mean bonus—achieve stability for any fixed number of arms $K\ge 2$ and even when multiple arms are optimal. The authors prove that stable optimistic TS yields asymptotically normal, Wald-type confidence intervals for each arm’s mean, with suboptimal arms pulled on a $\Theta(\log T)$ scale and optimal arms sharing the remainder of rounds, incurring only a mild regret cost. This brings practical inferential guarantees to adaptive experiments and A/B testing contexts where data are collected online, while maintaining competitive regret and offering versatile interpretations of the optimism mechanism.

Abstract

Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study this phenomenon in the $K$-armed Gaussian bandit and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, a sufficient condition for valid asymptotic inference requiring each arm's pull count to concentrate around a deterministic scale. First, we prove that variance-inflated TS \citep{halder2025stable} is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal. This resolves the open question raised by \citet{halder2025stable} through extending their results from the two-armed setting to the general $K$-armed setting. Second, we analyze an alternative optimistic modification that keeps the posterior variance unchanged but adds an explicit mean bonus to posterior mean, and establish the same stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid inference in multi-armed bandits, while incurring only a mild additional regret cost.

Optimism Stabilizes Thompson Sampling for Adaptive Inference

TL;DR

and even when multiple arms are optimal. The authors prove that stable optimistic TS yields asymptotically normal, Wald-type confidence intervals for each arm’s mean, with suboptimal arms pulled on a

scale and optimal arms sharing the remainder of rounds, incurring only a mild regret cost. This brings practical inferential guarantees to adaptive experiments and A/B testing contexts where data are collected online, while maintaining competitive regret and offering versatile interpretations of the optimism mechanism.

Abstract

-armed Gaussian bandit and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, a sufficient condition for valid asymptotic inference requiring each arm's pull count to concentrate around a deterministic scale. First, we prove that variance-inflated TS \citep{halder2025stable} is stable for any

, including the challenging regime where multiple arms are optimal. This resolves the open question raised by \citet{halder2025stable} through extending their results from the two-armed setting to the general

-armed setting. Second, we analyze an alternative optimistic modification that keeps the posterior variance unchanged but adds an explicit mean bonus to posterior mean, and establish the same stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid inference in multi-armed bandits, while incurring only a mild additional regret cost.

Paper Structure (56 sections, 21 theorems, 203 equations, 1 algorithm)

This paper contains 56 sections, 21 theorems, 203 equations, 1 algorithm.

Introduction
Contributions
Related Work
Adaptive inference for bandit algorithms.
Thompson sampling and stability.
Optimism and optimistic posterior sampling.
Other related work.
Notations
Preliminaries
Multi-armed Bandit
Adaptive Inference and Stability
Thompson Sampling and Optimism
Standard Thompson sampling (vanilla TS).
Optimism for stability.
Theoretical Guarantees
...and 41 more sections

Key Result

Proposition 2.2

If a bandit algorithm $\mathscr{A}$ is stable, then for any arm $a \in [K]$, where $\widehat{\mu}_{a,T}$ is the sample mean defined in eq:def:pull:time and $\widehat{\sigma}_{a,T}$ is the sample standard deviation of the rewards for arm $a$ up to time $T$: with the convention that $\widehat{\sigma}_{a,T}=1$ when $N_{a,T}=1$.

Theorems & Definitions (24)

Definition 2.1: Stability lai1982least
Proposition 2.2: Stability implies asymptotic normality lai1982least
Theorem 4.1: Stability for TS with variance inflation
Theorem 4.2: Stability for TS with mean bonus
Remark 4.3: Suboptimal pulls and regret
Theorem 4.4: Adaptive inference
Lemma A.1: Monotonicity
Lemma A.2: Dot-product inequality
Lemma A.3: Winner perturbation
Lemma A.4: Lower tail of $\theta^\star_{t+1}$
...and 14 more

Optimism Stabilizes Thompson Sampling for Adaptive Inference

TL;DR

Abstract

Optimism Stabilizes Thompson Sampling for Adaptive Inference

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (24)