Efficient and Adaptive Posterior Sampling Algorithms for Bandits
Bingshan Hu, Zhiming Huang, Tianyue H. Zhang, Mathias Lécuyer, Nidhi Hegde
TL;DR
The paper advances practical Thompson Sampling for bandits with bounded rewards by deriving a tighter finite-horizon regret bound for Gaussian-prior TS and proposing two scalable, parameterized variants, TS-MA-$α$ and TS-TD-$α$, that trade regret for reduced posterior sampling. TS-MA-$α$ aggregates samples in batches to decouple sampling complexity from the number of arms, while TS-TD-$α$ adaptively switches sampling effort over time to focus on promising arms. Theoretical results show per-arm regret of $O(\ln^{α+1}(T)/Δ)$ and scalable sampling costs, complemented by experiments demonstrating substantial sampling efficiency gains in large-arm settings. Collectively, the work offers tunable, computationally efficient alternatives to standard TS, enabling practical deployment in large-scale bandit applications with bounded rewards.
Abstract
We study Thompson Sampling-based algorithms for stochastic bandits with bounded rewards. As the existing problem-dependent regret bound for Thompson Sampling with Gaussian priors [Agrawal and Goyal, 2017] is vacuous when $T \le 288 e^{64}$, we derive a more practical bound that tightens the coefficient of the leading term %from $288 e^{64}$ to $1270$. Additionally, motivated by large-scale real-world applications that require scalability, adaptive computational resource allocation, and a balance in utility and computation, we propose two parameterized Thompson Sampling-based algorithms: Thompson Sampling with Model Aggregation (TS-MA-$α$) and Thompson Sampling with Timestamp Duelling (TS-TD-$α$), where $α\in [0,1]$ controls the trade-off between utility and computation. Both algorithms achieve $O \left(K\ln^{α+1}(T)/Δ\right)$ regret bound, where $K$ is the number of arms, $T$ is the finite learning horizon, and $Δ$ denotes the single round performance loss when pulling a sub-optimal arm.
