Table of Contents
Fetching ...

Achieving adaptivity and optimality for multi-armed bandits using Exponential-Kullback Leibler Maillard Sampling

Hao Qin, Kwang-Sung Jun, Chicheng Zhang

TL;DR

This work addresses K-armed bandits with rewards in a one-parameter exponential family (OPED) and seeks simultaneous asymptotic optimality, minimax efficiency with a √ln(K) factor, Sub-UCB, and adaptive variance. The authors introduce Exponential-Kullback-Leibler Maillard Sampling (Exp-KL-MS), a Maillard Sampling-inspired algorithm whose sampling probabilities depend on the KL divergence between arm estimates and the empirical best, modulated by an inverse-temperature function L. With the canonical choice L(k) = k-1, Exp-KL-MS achieves asymptotic optimality, a minimax ratio of √ln(K), Sub-UCB, and an adaptive variance bound, along with finite-time regret guarantees. The framework generalizes to other L(k) choices and opens avenues for extensions to broader sufficient statistics and contextual/generalized linear bandits, offering robust adaptive performance for OPED rewards. These results have potential implications for practical decision-making problems where reward models are naturally exponential-family and variance-aware guarantees are desirable.

Abstract

We study the problem of $K$-armed bandits with reward distributions belonging to a one-parameter exponential distribution family. In the literature, several criteria have been proposed to evaluate the performance of such algorithms, including Asymptotic Optimality, Minimax Optimality, Sub-UCB, and variance-adaptive worst-case regret bound. Thompson Sampling-based and Upper Confidence Bound-based algorithms have been employed to achieve some of these criteria. However, none of these algorithms simultaneously satisfy all the aforementioned criteria. In this paper, we design an algorithm, Exponential Kullback-Leibler Maillard Sampling (abbrev. Exp-KL-MS), that can achieve multiple optimality criteria simultaneously, including Asymptotic Optimality, Minimax Optimality with a $\sqrt{\ln (K)}$ factor, Sub-UCB, and variance-adaptive worst-case regret bound.

Achieving adaptivity and optimality for multi-armed bandits using Exponential-Kullback Leibler Maillard Sampling

TL;DR

This work addresses K-armed bandits with rewards in a one-parameter exponential family (OPED) and seeks simultaneous asymptotic optimality, minimax efficiency with a √ln(K) factor, Sub-UCB, and adaptive variance. The authors introduce Exponential-Kullback-Leibler Maillard Sampling (Exp-KL-MS), a Maillard Sampling-inspired algorithm whose sampling probabilities depend on the KL divergence between arm estimates and the empirical best, modulated by an inverse-temperature function L. With the canonical choice L(k) = k-1, Exp-KL-MS achieves asymptotic optimality, a minimax ratio of √ln(K), Sub-UCB, and an adaptive variance bound, along with finite-time regret guarantees. The framework generalizes to other L(k) choices and opens avenues for extensions to broader sufficient statistics and contextual/generalized linear bandits, offering robust adaptive performance for OPED rewards. These results have potential implications for practical decision-making problems where reward models are naturally exponential-family and variance-aware guarantees are desirable.

Abstract

We study the problem of -armed bandits with reward distributions belonging to a one-parameter exponential distribution family. In the literature, several criteria have been proposed to evaluate the performance of such algorithms, including Asymptotic Optimality, Minimax Optimality, Sub-UCB, and variance-adaptive worst-case regret bound. Thompson Sampling-based and Upper Confidence Bound-based algorithms have been employed to achieve some of these criteria. However, none of these algorithms simultaneously satisfy all the aforementioned criteria. In this paper, we design an algorithm, Exponential Kullback-Leibler Maillard Sampling (abbrev. Exp-KL-MS), that can achieve multiple optimality criteria simultaneously, including Asymptotic Optimality, Minimax Optimality with a factor, Sub-UCB, and variance-adaptive worst-case regret bound.

Paper Structure

This paper contains 55 sections, 31 theorems, 90 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Theorem 4

For any $K$-arm bandit problem with assum:opedassum:reward-dist, $\textsc{Exp-KL-MS}$ (alg:general-exp-kl-ms) with $L(k) = k - 1$ has regret bounded as follows. For any $\Delta > 0$ and $c \in (0, \frac{1}{4}]$:

Figures (2)

  • Figure 1: Case splitting of our regret analysis.
  • Figure 2: Roadmap of proof to the \ref{['thm:expected-regret-total']}

Theorems & Definitions (33)

  • Theorem 4
  • Corollary 5: Logarithmic Minimax Ratio
  • Corollary 6: Asymptotic Optimality
  • Corollary 7: Sub-UCB
  • Corollary 9: Adaptive Variance Ratio
  • Corollary 9
  • Corollary 9
  • Corollary 9
  • Corollary 9
  • Theorem 9
  • ...and 23 more