Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards
Hao Qin, Kwang-Sung Jun, Chicheng Zhang
TL;DR
KL-MS introduces a KL-based Maillard sampling rule for $[0,1]$-bounded reward bandits, yielding closed-form action probabilities and enabling reliable offline evaluation. It achieves asymptotic optimality in the Bernoulli setting, while offering a near-minimax regret bound that adapts to the optimal arm's variance and satisfies the sub-UCB criterion. The main contributions include a finite-time regret bound with clear decomposition, adaptive worst-case performance of $O\big(\sqrt{\dot{\mu}_1 K T \ln K} + K\ln T\big)$, and provable sub-UCB guarantees, along with a clean Bernoulli-specific asymptotic rate. The work positions KL-MS as a practical, off-policy-friendly alternative to Thompson sampling for bounded rewards, with potential extensions to exponential-family rewards and structured bandits that could further improve online learning and offline policy evaluation.
Abstract
We study $K$-armed bandit problems where the reward distributions of the arms are all supported on the $[0,1]$ interval. It has been a challenge to design regret-efficient randomized exploration algorithms in this setting. Maillard sampling \cite{maillard13apprentissage}, an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting \cite{bian2022maillard} while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we propose the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling for achieving KL-style gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has a worst-case regret bound of the form $O(\sqrt{μ^*(1-μ^*) K T \ln K} + K \ln T)$, where $μ^*$ is the expected reward of the optimal arm, and $T$ is the time horizon length.
