Table of Contents
Fetching ...

Efficient and Adaptive Posterior Sampling Algorithms for Bandits

Bingshan Hu, Zhiming Huang, Tianyue H. Zhang, Mathias Lécuyer, Nidhi Hegde

TL;DR

The paper advances practical Thompson Sampling for bandits with bounded rewards by deriving a tighter finite-horizon regret bound for Gaussian-prior TS and proposing two scalable, parameterized variants, TS-MA-$α$ and TS-TD-$α$, that trade regret for reduced posterior sampling. TS-MA-$α$ aggregates samples in batches to decouple sampling complexity from the number of arms, while TS-TD-$α$ adaptively switches sampling effort over time to focus on promising arms. Theoretical results show per-arm regret of $O(\ln^{α+1}(T)/Δ)$ and scalable sampling costs, complemented by experiments demonstrating substantial sampling efficiency gains in large-arm settings. Collectively, the work offers tunable, computationally efficient alternatives to standard TS, enabling practical deployment in large-scale bandit applications with bounded rewards.

Abstract

We study Thompson Sampling-based algorithms for stochastic bandits with bounded rewards. As the existing problem-dependent regret bound for Thompson Sampling with Gaussian priors [Agrawal and Goyal, 2017] is vacuous when $T \le 288 e^{64}$, we derive a more practical bound that tightens the coefficient of the leading term %from $288 e^{64}$ to $1270$. Additionally, motivated by large-scale real-world applications that require scalability, adaptive computational resource allocation, and a balance in utility and computation, we propose two parameterized Thompson Sampling-based algorithms: Thompson Sampling with Model Aggregation (TS-MA-$α$) and Thompson Sampling with Timestamp Duelling (TS-TD-$α$), where $α\in [0,1]$ controls the trade-off between utility and computation. Both algorithms achieve $O \left(K\ln^{α+1}(T)/Δ\right)$ regret bound, where $K$ is the number of arms, $T$ is the finite learning horizon, and $Δ$ denotes the single round performance loss when pulling a sub-optimal arm.

Efficient and Adaptive Posterior Sampling Algorithms for Bandits

TL;DR

The paper advances practical Thompson Sampling for bandits with bounded rewards by deriving a tighter finite-horizon regret bound for Gaussian-prior TS and proposing two scalable, parameterized variants, TS-MA- and TS-TD-, that trade regret for reduced posterior sampling. TS-MA- aggregates samples in batches to decouple sampling complexity from the number of arms, while TS-TD- adaptively switches sampling effort over time to focus on promising arms. Theoretical results show per-arm regret of and scalable sampling costs, complemented by experiments demonstrating substantial sampling efficiency gains in large-arm settings. Collectively, the work offers tunable, computationally efficient alternatives to standard TS, enabling practical deployment in large-scale bandit applications with bounded rewards.

Abstract

We study Thompson Sampling-based algorithms for stochastic bandits with bounded rewards. As the existing problem-dependent regret bound for Thompson Sampling with Gaussian priors [Agrawal and Goyal, 2017] is vacuous when , we derive a more practical bound that tightens the coefficient of the leading term %from to . Additionally, motivated by large-scale real-world applications that require scalability, adaptive computational resource allocation, and a balance in utility and computation, we propose two parameterized Thompson Sampling-based algorithms: Thompson Sampling with Model Aggregation (TS-MA-) and Thompson Sampling with Timestamp Duelling (TS-TD-), where controls the trade-off between utility and computation. Both algorithms achieve regret bound, where is the number of arms, is the finite learning horizon, and denotes the single round performance loss when pulling a sub-optimal arm.
Paper Structure (29 sections, 12 theorems, 70 equations, 3 figures, 3 algorithms)

This paper contains 29 sections, 12 theorems, 70 equations, 3 figures, 3 algorithms.

Key Result

Theorem 1

The problem-dependent regret of Algorithm Alg: TS is $\sum_{i: \Delta_i >0} 1270 \ln \left(T \Delta_i^2 + 100^{\frac{1}{3}}\right)/\Delta_i + 182.5/\Delta_i+ \Delta_i$. The worst-case regret is $O \left( \sqrt{KT \ln(K)} \right)$, where $O(\cdot)$ only hides a universal constant.

Figures (3)

  • Figure 1: Comparison between different algorithms for $20$ arms with one optimal arm.
  • Figure 2: Total number of drawn posterior samples in TS-TD-$\alpha$.
  • Figure 3: Percentage of data-dependent samples drawn for optimal arms in TS-Ma-$\alpha$.

Theorems & Definitions (23)

  • Theorem 1
  • Lemma 2
  • Theorem 3
  • Theorem 4
  • proof
  • proof
  • Lemma 8
  • proof : Proof of Lemma \ref{['bandit: lemma 1']}
  • Lemma 9
  • proof : Proof of Lemma \ref{['Eric 3']}
  • ...and 13 more