Table of Contents
Fetching ...

Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

Hao Qin, Kwang-Sung Jun, Chicheng Zhang

TL;DR

KL-MS introduces a KL-based Maillard sampling rule for $[0,1]$-bounded reward bandits, yielding closed-form action probabilities and enabling reliable offline evaluation. It achieves asymptotic optimality in the Bernoulli setting, while offering a near-minimax regret bound that adapts to the optimal arm's variance and satisfies the sub-UCB criterion. The main contributions include a finite-time regret bound with clear decomposition, adaptive worst-case performance of $O\big(\sqrt{\dot{\mu}_1 K T \ln K} + K\ln T\big)$, and provable sub-UCB guarantees, along with a clean Bernoulli-specific asymptotic rate. The work positions KL-MS as a practical, off-policy-friendly alternative to Thompson sampling for bounded rewards, with potential extensions to exponential-family rewards and structured bandits that could further improve online learning and offline policy evaluation.

Abstract

We study $K$-armed bandit problems where the reward distributions of the arms are all supported on the $[0,1]$ interval. It has been a challenge to design regret-efficient randomized exploration algorithms in this setting. Maillard sampling \cite{maillard13apprentissage}, an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting \cite{bian2022maillard} while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we propose the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling for achieving KL-style gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has a worst-case regret bound of the form $O(\sqrt{μ^*(1-μ^*) K T \ln K} + K \ln T)$, where $μ^*$ is the expected reward of the optimal arm, and $T$ is the time horizon length.

Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

TL;DR

KL-MS introduces a KL-based Maillard sampling rule for -bounded reward bandits, yielding closed-form action probabilities and enabling reliable offline evaluation. It achieves asymptotic optimality in the Bernoulli setting, while offering a near-minimax regret bound that adapts to the optimal arm's variance and satisfies the sub-UCB criterion. The main contributions include a finite-time regret bound with clear decomposition, adaptive worst-case performance of , and provable sub-UCB guarantees, along with a clean Bernoulli-specific asymptotic rate. The work positions KL-MS as a practical, off-policy-friendly alternative to Thompson sampling for bounded rewards, with potential extensions to exponential-family rewards and structured bandits that could further improve online learning and offline policy evaluation.

Abstract

We study -armed bandit problems where the reward distributions of the arms are all supported on the interval. It has been a challenge to design regret-efficient randomized exploration algorithms in this setting. Maillard sampling \cite{maillard13apprentissage}, an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting \cite{bian2022maillard} while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we propose the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling for achieving KL-style gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has a worst-case regret bound of the form , where is the expected reward of the optimal arm, and is the time horizon length.
Paper Structure (66 sections, 31 theorems, 141 equations, 15 figures, 9 tables, 2 algorithms)

This paper contains 66 sections, 31 theorems, 141 equations, 15 figures, 9 tables, 2 algorithms.

Key Result

Theorem 1

For any $K$-arm bandit problem with reward distribution supported on $[0,1]$, KL-MS has regret bounded as follows. For any $\Delta \geq 0$ and $c \in (0, \frac{1}{4}]$:

Figures (15)

  • Figure 1: $\mu = [0.20, 0.25], T = 10,000$
  • Figure 2: $\mu = [0.80, 0.90], T = 10,000$
  • Figure 3: $M = 10^3$
  • Figure 4: $M = 10^4$
  • Figure 5: $M = 10^5$
  • ...and 10 more figures

Theorems & Definitions (34)

  • Theorem 1
  • Theorem 2: Sub-UCB
  • Theorem 3: Adaptive worst-case regret
  • Theorem 4
  • Lemma 5
  • Remark 6
  • Remark 7
  • Lemma 8
  • Lemma 9: Lemma \ref{['lemma:expected-sub-optimal-arm-pull-main']} restated
  • Lemma 10
  • ...and 24 more