Table of Contents
Fetching ...

A Broader View of Thompson Sampling

Yanlin Qu, Hongseok Namkoong, Assaf Zeevi

TL;DR

The paper addresses why Thompson Sampling balances exploration and exploitation by recasting posterior sampling as an online optimization problem under a faithful stationarization of long-horizon regret. By replacing the standard discounted-DP objective with a stationary squared-regret ($\mathcal{R}^2$) objective, the authors derive a stationary Bellman equation and a time-invariant optimal policy, $x^*(\pi)$, that minimizes instantaneous regret augmented by a regularizer. Thompson Sampling then emerges as a special case of this online optimization, with a biserial-covariance regularizer $\tilde{\nu}(\pi)$ that quantifies remaining uncertainty about which arm is better. Through one-arm and two-arm analyses, the paper characterizes how the $\mathcal{R}^2$-optimal policy differs from TS, demonstrates a principled way to compare them, and proposes a fix (a conditional regularizer shutdown) to align TS more closely with the $\mathcal{R}^2$-optimal objective. The framework offers a principled path to understanding and improving posterior sampling methods beyond their traditional heuristics, with implications for designing uncertainty-aware bandit algorithms that optimally trade off exploration and exploitation.

Abstract

Thompson Sampling is one of the most widely used and studied bandit algorithms, known for its simple structure, low regret performance, and solid theoretical guarantees. Yet, in stark contrast to most other families of bandit algorithms, the exact mechanism through which posterior sampling (as introduced by Thompson) is able to "properly" balance exploration and exploitation, remains a mystery. In this paper we show that the core insight to address this question stems from recasting Thompson Sampling as an online optimization algorithm. To distill this, a key conceptual tool is introduced, which we refer to as "faithful" stationarization of the regret formulation. Essentially, the finite horizon dynamic optimization problem is converted into a stationary counterpart which "closely resembles" the original objective (in contrast, the classical infinite horizon discounted formulation, that leads to the Gittins index, alters the problem and objective in too significant a manner). The newly crafted time invariant objective can be studied using Bellman's principle which leads to a time invariant optimal policy. When viewed through this lens, Thompson Sampling admits a simple online optimization form that mimics the structure of the Bellman-optimal policy, and where greediness is regularized by a measure of residual uncertainty based on point-biserial correlation. This answers the question of how Thompson Sampling balances exploration-exploitation, and moreover, provides a principled framework to study and further improve Thompson's original idea.

A Broader View of Thompson Sampling

TL;DR

The paper addresses why Thompson Sampling balances exploration and exploitation by recasting posterior sampling as an online optimization problem under a faithful stationarization of long-horizon regret. By replacing the standard discounted-DP objective with a stationary squared-regret () objective, the authors derive a stationary Bellman equation and a time-invariant optimal policy, , that minimizes instantaneous regret augmented by a regularizer. Thompson Sampling then emerges as a special case of this online optimization, with a biserial-covariance regularizer that quantifies remaining uncertainty about which arm is better. Through one-arm and two-arm analyses, the paper characterizes how the -optimal policy differs from TS, demonstrates a principled way to compare them, and proposes a fix (a conditional regularizer shutdown) to align TS more closely with the -optimal objective. The framework offers a principled path to understanding and improving posterior sampling methods beyond their traditional heuristics, with implications for designing uncertainty-aware bandit algorithms that optimally trade off exploration and exploitation.

Abstract

Thompson Sampling is one of the most widely used and studied bandit algorithms, known for its simple structure, low regret performance, and solid theoretical guarantees. Yet, in stark contrast to most other families of bandit algorithms, the exact mechanism through which posterior sampling (as introduced by Thompson) is able to "properly" balance exploration and exploitation, remains a mystery. In this paper we show that the core insight to address this question stems from recasting Thompson Sampling as an online optimization algorithm. To distill this, a key conceptual tool is introduced, which we refer to as "faithful" stationarization of the regret formulation. Essentially, the finite horizon dynamic optimization problem is converted into a stationary counterpart which "closely resembles" the original objective (in contrast, the classical infinite horizon discounted formulation, that leads to the Gittins index, alters the problem and objective in too significant a manner). The newly crafted time invariant objective can be studied using Bellman's principle which leads to a time invariant optimal policy. When viewed through this lens, Thompson Sampling admits a simple online optimization form that mimics the structure of the Bellman-optimal policy, and where greediness is regularized by a measure of residual uncertainty based on point-biserial correlation. This answers the question of how Thompson Sampling balances exploration-exploitation, and moreover, provides a principled framework to study and further improve Thompson's original idea.

Paper Structure

This paper contains 15 sections, 8 theorems, 56 equations, 7 figures, 2 algorithms.

Key Result

Proposition 1

If there exists a finite constant $\sigma>0$ such that the posterior predictive distribution of the reward ($R_{t+1}\sim P_{\theta"},\;\theta"\sim\pi_t$) is always $\sigma$-sub-Gaussian, then Thompson Sampling satisfies $\mathcal{R}^2(Q^{\mathrm{TS}};\pi_0)<\infty$, and hence $\mathcal{R}_T(Q^{\math

Figures (7)

  • Figure 1: Thompson Sampling and the $\mathcal{R}^2$-optimal policy play a Gaussian bandit with reward variance $1$. Left: comparing their cumulative regret $\mathcal{R}_T(Q^{\text{TS}};\pi_0)$ vs. $\mathcal{R}_T(Q^*;\pi_0)$ where $\pi_0=N(0,1)\times N(0,0)$ (20K trials). Right: comparing the two regularizers $\tilde{\nu}(N(\mu,1)\times N(0,0))$ vs. $\nu(N(\mu,1)\times N(0,0))$ where $\mu$ approaches $0$ from below.
  • Figure 2: UCB plays a two-armed Bernoulli bandit. Left: confidence intervals around empirical means. Right: upper confidence bounds. The suboptimal arm (arm 2) is pulled whenever the corresponding upper confidence bound is higher.
  • Figure 3: Thompson Sampling plays a two-armed Bernoulli bandit. Left: credible intervals around posterior means. Right: the overlap of credible intervals. The overlap, when present, reflects the frequency of pulling the suboptimal arm (arm2).
  • Figure 4: Thompson Sampling plays a two-armed Bernoulli bandit. Left: the overlap of credible intervals vs. the pulling rate of the suboptimal arm. Right: the regularizer of Thompson Sampling vs. the pulling rate of the suboptimal arm.
  • Figure 5: Thompson Sampling and the $\mathcal{R}^2$-optimal policy (with different values of $\bar{M}$) play a Bernoulli bandit. Left: comparing their cumulative regret $\mathcal{R}_T(Q^{\text{TS}};\pi_0)$ vs. $\mathcal{R}_T(Q^{\bar{M}};\pi_0)$ where $\pi_0=\mathrm{Beta}(1,1)\times\mathrm{Beta}(1,1)$ (200K trials). Right: comparing the two regularizers $\tilde{\nu}(\mathrm{Beta}(5,4)\times\mathrm{Beta}(k,k))$ vs. $\nu^{\bar{M}}(\mathrm{Beta}(5,4)\times\mathrm{Beta}(k,k))$ where $\bar{M}=40$ and $k=1,...,7$.
  • ...and 2 more figures

Theorems & Definitions (19)

  • Proposition 1: $\mathcal{R}^2$-finiteness of Thompson Sampling
  • Remark 1
  • Remark 2
  • Theorem 1: Online optimization
  • Proposition 2: Covariance factorization
  • Proposition 3: Incomplete learning
  • Remark 3
  • Proposition 4: Closed-form solution
  • Proposition 5: Phase change
  • Remark 4
  • ...and 9 more