Table of Contents
Fetching ...

Optimism Stabilizes Thompson Sampling for Adaptive Inference

Shunxing Yan, Han Zhong

TL;DR

This work addresses the challenge of conducting valid statistical inference under adaptive data collection for Thompson sampling in Gaussian K-armed bandits. It identifies optimism as the unifying mechanism that stabilizes TS, showing that two principled variants—variance inflation and a mean bonus—achieve stability for any fixed number of arms $K\ge 2$ and even when multiple arms are optimal. The authors prove that stable optimistic TS yields asymptotically normal, Wald-type confidence intervals for each arm’s mean, with suboptimal arms pulled on a $\Theta(\log T)$ scale and optimal arms sharing the remainder of rounds, incurring only a mild regret cost. This brings practical inferential guarantees to adaptive experiments and A/B testing contexts where data are collected online, while maintaining competitive regret and offering versatile interpretations of the optimism mechanism.

Abstract

Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study this phenomenon in the $K$-armed Gaussian bandit and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, a sufficient condition for valid asymptotic inference requiring each arm's pull count to concentrate around a deterministic scale. First, we prove that variance-inflated TS \citep{halder2025stable} is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal. This resolves the open question raised by \citet{halder2025stable} through extending their results from the two-armed setting to the general $K$-armed setting. Second, we analyze an alternative optimistic modification that keeps the posterior variance unchanged but adds an explicit mean bonus to posterior mean, and establish the same stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid inference in multi-armed bandits, while incurring only a mild additional regret cost.

Optimism Stabilizes Thompson Sampling for Adaptive Inference

TL;DR

This work addresses the challenge of conducting valid statistical inference under adaptive data collection for Thompson sampling in Gaussian K-armed bandits. It identifies optimism as the unifying mechanism that stabilizes TS, showing that two principled variants—variance inflation and a mean bonus—achieve stability for any fixed number of arms and even when multiple arms are optimal. The authors prove that stable optimistic TS yields asymptotically normal, Wald-type confidence intervals for each arm’s mean, with suboptimal arms pulled on a scale and optimal arms sharing the remainder of rounds, incurring only a mild regret cost. This brings practical inferential guarantees to adaptive experiments and A/B testing contexts where data are collected online, while maintaining competitive regret and offering versatile interpretations of the optimism mechanism.

Abstract

Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study this phenomenon in the -armed Gaussian bandit and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, a sufficient condition for valid asymptotic inference requiring each arm's pull count to concentrate around a deterministic scale. First, we prove that variance-inflated TS \citep{halder2025stable} is stable for any , including the challenging regime where multiple arms are optimal. This resolves the open question raised by \citet{halder2025stable} through extending their results from the two-armed setting to the general -armed setting. Second, we analyze an alternative optimistic modification that keeps the posterior variance unchanged but adds an explicit mean bonus to posterior mean, and establish the same stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid inference in multi-armed bandits, while incurring only a mild additional regret cost.
Paper Structure (56 sections, 21 theorems, 203 equations, 1 algorithm)

This paper contains 56 sections, 21 theorems, 203 equations, 1 algorithm.

Key Result

Proposition 2.2

If a bandit algorithm $\mathscr{A}$ is stable, then for any arm $a \in [K]$, where $\widehat{\mu}_{a,T}$ is the sample mean defined in eq:def:pull:time and $\widehat{\sigma}_{a,T}$ is the sample standard deviation of the rewards for arm $a$ up to time $T$: with the convention that $\widehat{\sigma}_{a,T}=1$ when $N_{a,T}=1$.

Theorems & Definitions (24)

  • Definition 2.1: Stability lai1982least
  • Proposition 2.2: Stability implies asymptotic normality lai1982least
  • Theorem 4.1: Stability for TS with variance inflation
  • Theorem 4.2: Stability for TS with mean bonus
  • Remark 4.3: Suboptimal pulls and regret
  • Theorem 4.4: Adaptive inference
  • Lemma A.1: Monotonicity
  • Lemma A.2: Dot-product inequality
  • Lemma A.3: Winner perturbation
  • Lemma A.4: Lower tail of $\theta^\star_{t+1}$
  • ...and 14 more