Achieving adaptivity and optimality for multi-armed bandits using Exponential-Kullback Leibler Maillard Sampling
Hao Qin, Kwang-Sung Jun, Chicheng Zhang
TL;DR
This work addresses K-armed bandits with rewards in a one-parameter exponential family (OPED) and seeks simultaneous asymptotic optimality, minimax efficiency with a √ln(K) factor, Sub-UCB, and adaptive variance. The authors introduce Exponential-Kullback-Leibler Maillard Sampling (Exp-KL-MS), a Maillard Sampling-inspired algorithm whose sampling probabilities depend on the KL divergence between arm estimates and the empirical best, modulated by an inverse-temperature function L. With the canonical choice L(k) = k-1, Exp-KL-MS achieves asymptotic optimality, a minimax ratio of √ln(K), Sub-UCB, and an adaptive variance bound, along with finite-time regret guarantees. The framework generalizes to other L(k) choices and opens avenues for extensions to broader sufficient statistics and contextual/generalized linear bandits, offering robust adaptive performance for OPED rewards. These results have potential implications for practical decision-making problems where reward models are naturally exponential-family and variance-aware guarantees are desirable.
Abstract
We study the problem of $K$-armed bandits with reward distributions belonging to a one-parameter exponential distribution family. In the literature, several criteria have been proposed to evaluate the performance of such algorithms, including Asymptotic Optimality, Minimax Optimality, Sub-UCB, and variance-adaptive worst-case regret bound. Thompson Sampling-based and Upper Confidence Bound-based algorithms have been employed to achieve some of these criteria. However, none of these algorithms simultaneously satisfy all the aforementioned criteria. In this paper, we design an algorithm, Exponential Kullback-Leibler Maillard Sampling (abbrev. Exp-KL-MS), that can achieve multiple optimality criteria simultaneously, including Asymptotic Optimality, Minimax Optimality with a $\sqrt{\ln (K)}$ factor, Sub-UCB, and variance-adaptive worst-case regret bound.
