Efficient and Adaptive Posterior Sampling Algorithms for Bandits

Bingshan Hu; Zhiming Huang; Tianyue H. Zhang; Mathias Lécuyer; Nidhi Hegde

Efficient and Adaptive Posterior Sampling Algorithms for Bandits

Bingshan Hu, Zhiming Huang, Tianyue H. Zhang, Mathias Lécuyer, Nidhi Hegde

TL;DR

The paper advances practical Thompson Sampling for bandits with bounded rewards by deriving a tighter finite-horizon regret bound for Gaussian-prior TS and proposing two scalable, parameterized variants, TS-MA-$α$ and TS-TD-$α$, that trade regret for reduced posterior sampling. TS-MA-$α$ aggregates samples in batches to decouple sampling complexity from the number of arms, while TS-TD-$α$ adaptively switches sampling effort over time to focus on promising arms. Theoretical results show per-arm regret of $O(\ln^{α+1}(T)/Δ)$ and scalable sampling costs, complemented by experiments demonstrating substantial sampling efficiency gains in large-arm settings. Collectively, the work offers tunable, computationally efficient alternatives to standard TS, enabling practical deployment in large-scale bandit applications with bounded rewards.

Abstract

We study Thompson Sampling-based algorithms for stochastic bandits with bounded rewards. As the existing problem-dependent regret bound for Thompson Sampling with Gaussian priors [Agrawal and Goyal, 2017] is vacuous when $T \le 288 e^{64}$, we derive a more practical bound that tightens the coefficient of the leading term %from $288 e^{64}$ to $1270$. Additionally, motivated by large-scale real-world applications that require scalability, adaptive computational resource allocation, and a balance in utility and computation, we propose two parameterized Thompson Sampling-based algorithms: Thompson Sampling with Model Aggregation (TS-MA-$α$) and Thompson Sampling with Timestamp Duelling (TS-TD-$α$), where $α\in [0,1]$ controls the trade-off between utility and computation. Both algorithms achieve $O \left(K\ln^{α+1}(T)/Δ\right)$ regret bound, where $K$ is the number of arms, $T$ is the finite learning horizon, and $Δ$ denotes the single round performance loss when pulling a sub-optimal arm.

Efficient and Adaptive Posterior Sampling Algorithms for Bandits

TL;DR

and TS-TD-

, that trade regret for reduced posterior sampling. TS-MA-

aggregates samples in batches to decouple sampling complexity from the number of arms, while TS-TD-

adaptively switches sampling effort over time to focus on promising arms. Theoretical results show per-arm regret of

and scalable sampling costs, complemented by experiments demonstrating substantial sampling efficiency gains in large-arm settings. Collectively, the work offers tunable, computationally efficient alternatives to standard TS, enabling practical deployment in large-scale bandit applications with bounded rewards.

Abstract

, we derive a more practical bound that tightens the coefficient of the leading term %from

. Additionally, motivated by large-scale real-world applications that require scalability, adaptive computational resource allocation, and a balance in utility and computation, we propose two parameterized Thompson Sampling-based algorithms: Thompson Sampling with Model Aggregation (TS-MA-

) and Thompson Sampling with Timestamp Duelling (TS-TD-

), where

controls the trade-off between utility and computation. Both algorithms achieve

regret bound, where

is the number of arms,

is the finite learning horizon, and

denotes the single round performance loss when pulling a sub-optimal arm.

Paper Structure (29 sections, 12 theorems, 70 equations, 3 figures, 3 algorithms)

This paper contains 29 sections, 12 theorems, 70 equations, 3 figures, 3 algorithms.

Introduction
Learning Problem
Related Work
Algorithms
Vanilla Thompson Sampling
TS-MA-$\alpha$
TS-TD-$\alpha$
Experimental Results
Discussion and future work
Motivations
Useful Facts
Proofs for Theorem \ref{['Theorem: TS new 1']}
Proofs for Lemma \ref{['UBC 22']}
Re-statement of Lemma \ref{['UBC 22']}.
Proofs for (\ref{['UBC 333']}).
...and 14 more sections

Key Result

Theorem 1

The problem-dependent regret of Algorithm Alg: TS is $\sum_{i: \Delta_i >0} 1270 \ln \left(T \Delta_i^2 + 100^{\frac{1}{3}}\right)/\Delta_i + 182.5/\Delta_i+ \Delta_i$. The worst-case regret is $O \left( \sqrt{KT \ln(K)} \right)$, where $O(\cdot)$ only hides a universal constant.

Figures (3)

Figure 1: Comparison between different algorithms for $20$ arms with one optimal arm.
Figure 2: Total number of drawn posterior samples in TS-TD-$\alpha$.
Figure 3: Percentage of data-dependent samples drawn for optimal arms in TS-Ma-$\alpha$.

Theorems & Definitions (23)

Theorem 1
Lemma 2
Theorem 3
Theorem 4
proof
proof
Lemma 8
proof : Proof of Lemma \ref{['bandit: lemma 1']}
Lemma 9
proof : Proof of Lemma \ref{['Eric 3']}
...and 13 more

Efficient and Adaptive Posterior Sampling Algorithms for Bandits

TL;DR

Abstract

Efficient and Adaptive Posterior Sampling Algorithms for Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (23)