Table of Contents
Fetching ...

Rethinking Langevin Thompson Sampling from A Stochastic Approximation Perspective

Weixin Wang, Haoyang Zheng, Guang Lin, Wei Deng, Pan Xu

TL;DR

This work introduces TS-SA, a Thompson Sampling variant that replaces a non-stationary, round-specific posterior with a fixed stationary target posterior. By integrating stochastic approximation through time-averaging of Langevin proposals and using gradient estimates from the most recent rewards, TS-SA achieves a fixed step-size and a unified convergence framework. Theoretical results establish posterior concentration and near-optimal regret bounds $\widetilde{\mathcal{O}}(\sqrt{KT})$, while experiments show strong empirical performance and robustness compared to TS, UCB, and TS-SGLD. The stationary-target perspective simplifies analysis, reduces memory, and offers practical guidelines for implementation in non-conjugate reward settings. TS-SA thus provides a principled, efficient alternative to dynamic-posterior TS methods with broad applicability to bandits and related sequential decision problems.

Abstract

Most existing approximate Thompson Sampling (TS) algorithms for multi-armed bandits use Stochastic Gradient Langevin Dynamics (SGLD) or its variants in each round to sample from the posterior, relaxing the need for conjugacy assumptions between priors and reward distributions in vanilla TS. However, they often require approximating a different posterior distribution in different round of the bandit problem. This requires tricky, round-specific tuning of hyperparameters such as dynamic learning rates, causing challenges in both theoretical analysis and practical implementation. To alleviate this non-stationarity, we introduce TS-SA, which incorporates stochastic approximation (SA) within the TS framework. In each round, TS-SA constructs a posterior approximation only using the most recent reward(s), performs a Langevin Monte Carlo (LMC) update, and applies an SA step to average noisy proposals over time. This can be interpreted as approximating a stationary posterior target throughout the entire algorithm, which further yields a fixed step-size, a unified convergence analysis framework, and improved posterior estimates through temporal averaging. We establish near-optimal regret bounds for TS-SA, with a simplified and more intuitive theoretical analysis enabled by interpreting the entire algorithm as a simulation of a stationary SGLD process. Our empirical results demonstrate that even a single-step Langevin update with certain warm-up outperforms existing methods substantially on bandit tasks.

Rethinking Langevin Thompson Sampling from A Stochastic Approximation Perspective

TL;DR

This work introduces TS-SA, a Thompson Sampling variant that replaces a non-stationary, round-specific posterior with a fixed stationary target posterior. By integrating stochastic approximation through time-averaging of Langevin proposals and using gradient estimates from the most recent rewards, TS-SA achieves a fixed step-size and a unified convergence framework. Theoretical results establish posterior concentration and near-optimal regret bounds , while experiments show strong empirical performance and robustness compared to TS, UCB, and TS-SGLD. The stationary-target perspective simplifies analysis, reduces memory, and offers practical guidelines for implementation in non-conjugate reward settings. TS-SA thus provides a principled, efficient alternative to dynamic-posterior TS methods with broad applicability to bandits and related sequential decision problems.

Abstract

Most existing approximate Thompson Sampling (TS) algorithms for multi-armed bandits use Stochastic Gradient Langevin Dynamics (SGLD) or its variants in each round to sample from the posterior, relaxing the need for conjugacy assumptions between priors and reward distributions in vanilla TS. However, they often require approximating a different posterior distribution in different round of the bandit problem. This requires tricky, round-specific tuning of hyperparameters such as dynamic learning rates, causing challenges in both theoretical analysis and practical implementation. To alleviate this non-stationarity, we introduce TS-SA, which incorporates stochastic approximation (SA) within the TS framework. In each round, TS-SA constructs a posterior approximation only using the most recent reward(s), performs a Langevin Monte Carlo (LMC) update, and applies an SA step to average noisy proposals over time. This can be interpreted as approximating a stationary posterior target throughout the entire algorithm, which further yields a fixed step-size, a unified convergence analysis framework, and improved posterior estimates through temporal averaging. We establish near-optimal regret bounds for TS-SA, with a simplified and more intuitive theoretical analysis enabled by interpreting the entire algorithm as a simulation of a stationary SGLD process. Our empirical results demonstrate that even a single-step Langevin update with certain warm-up outperforms existing methods substantially on bandit tasks.

Paper Structure

This paper contains 37 sections, 16 theorems, 110 equations, 4 figures, 2 tables, 2 algorithms.

Key Result

Lemma 5.4

Suppose assum:strongly_concave_thetaassum:joint_lipschitz hold. For any arm $a$, $\mu^{\text{SA}}_a \propto \exp (\sum_{i=1}^T \log p_a({X}_i|\bm{\theta}) )$ is the target posterior distribution, then for any $p \geq 1$, where $\kappa_a = \max\{ \frac{L_a}{m_a}, \frac{L_a}{\nu_a} \}$. This further implies that $\|{\bm{\theta}}-\bm{\theta}_a^*\|_{{\bm{\theta}} \sim \mu_a^{\text{SA}}}$ has sub-Gaus

Figures (4)

  • Figure 1: Regrets under different reward gaps $\Delta$, different number of arms $K$, and reward settings.
  • Figure 2: Slice plots of TS-SA hyperparameter tuning using Bayesian hyperparameter optimization, under fixed settings $K=10$, $\Delta=0.5$, $\tau=1.0$, and $N=1$. The $x$-axis denotes the tuning range of a single hyperparameter, while the $y$-axis shows the corresponding cumulative regret averaged over 50 independent runs. The color gradient from blue (early trials) to red (later trials) reflects the optimization trajectory. Regret decreases as Bayesian hyperparameter optimization identifies favorable values for warm-up pulls, batch size, and SA parameters $(c_1, c_2, \alpha)$, while the step size $h$ and offset $c_3$ exhibit relatively weak influence.
  • Figure 3: Ablation study of key hyperparameters in TS-SA ($K=10$, $\Delta=0.5$). The x-axis denotes the number of rounds, while the y-axis represents the cumulative regrets over 100 independent trials. Each subfigure plots the cumulative regret as a function of a single parameter while keeping others fixed. (a–b): Warm-up pulls ($\Omega$) and batch size ($\mathcal{B}$) show clear threshold behavior around 20, below which performance degrades significantly. (c): Increasing inner iterations ($N$) improves performance with diminishing returns. (d–h): Sampling-related parameters ($h$, $c_1$, $\alpha$) exhibit notable sensitivity and must be tuned carefully, while $c_2$ and $c_3$ show minor impact, which indicates robustness with respect to these choices.
  • Figure : Thompson Sampling in MAB

Theorems & Definitions (21)

  • Remark 4.1
  • Lemma 5.4: Concentration of target posterior
  • Lemma 5.5: Convergence of TS-SA
  • Remark 5.6
  • Theorem 5.7: Concentration of TS-SA approximate posterior
  • Remark 5.8
  • Theorem 5.9: Regret bound
  • Remark 5.10
  • Remark 6.1
  • Lemma B.1: Concentration of target posterior
  • ...and 11 more