Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

Hao Qin; Kwang-Sung Jun; Chicheng Zhang

Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

Hao Qin, Kwang-Sung Jun, Chicheng Zhang

TL;DR

KL-MS introduces a KL-based Maillard sampling rule for $[0,1]$-bounded reward bandits, yielding closed-form action probabilities and enabling reliable offline evaluation. It achieves asymptotic optimality in the Bernoulli setting, while offering a near-minimax regret bound that adapts to the optimal arm's variance and satisfies the sub-UCB criterion. The main contributions include a finite-time regret bound with clear decomposition, adaptive worst-case performance of $O\big(\sqrt{\dot{\mu}_1 K T \ln K} + K\ln T\big)$, and provable sub-UCB guarantees, along with a clean Bernoulli-specific asymptotic rate. The work positions KL-MS as a practical, off-policy-friendly alternative to Thompson sampling for bounded rewards, with potential extensions to exponential-family rewards and structured bandits that could further improve online learning and offline policy evaluation.

Abstract

We study $K$-armed bandit problems where the reward distributions of the arms are all supported on the $[0,1]$ interval. It has been a challenge to design regret-efficient randomized exploration algorithms in this setting. Maillard sampling \cite{maillard13apprentissage}, an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting \cite{bian2022maillard} while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we propose the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling for achieving KL-style gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has a worst-case regret bound of the form $O(\sqrt{μ^*(1-μ^*) K T \ln K} + K \ln T)$, where $μ^*$ is the expected reward of the optimal arm, and $T$ is the time horizon length.

Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

TL;DR

KL-MS introduces a KL-based Maillard sampling rule for

-bounded reward bandits, yielding closed-form action probabilities and enabling reliable offline evaluation. It achieves asymptotic optimality in the Bernoulli setting, while offering a near-minimax regret bound that adapts to the optimal arm's variance and satisfies the sub-UCB criterion. The main contributions include a finite-time regret bound with clear decomposition, adaptive worst-case performance of

, and provable sub-UCB guarantees, along with a clean Bernoulli-specific asymptotic rate. The work positions KL-MS as a practical, off-policy-friendly alternative to Thompson sampling for bounded rewards, with potential extensions to exponential-family rewards and structured bandits that could further improve online learning and offline policy evaluation.

Abstract

We study

-armed bandit problems where the reward distributions of the arms are all supported on the

interval. It has been a challenge to design regret-efficient randomized exploration algorithms in this setting. Maillard sampling \cite{maillard13apprentissage}, an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting \cite{bian2022maillard} while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we propose the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling for achieving KL-style gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has a worst-case regret bound of the form

, where

is the expected reward of the optimal arm, and

is the time horizon length.

Paper Structure (66 sections, 31 theorems, 141 equations, 15 figures, 9 tables, 2 algorithms)

This paper contains 66 sections, 31 theorems, 141 equations, 15 figures, 9 tables, 2 algorithms.

Introduction
Our contributions.
Preliminaries
Asymptotic optimality in the Bernoulli reward setting
Minimax ratio
Sub-UCB
Related Work
Bandits with bounded rewards.
Randomized exploration for bandits.
Binarization trick.
Bandit algorithms with worst-case regrets that depend on the optimal reward.
Main Result
The KL Maillard Sampling Algorithm.
Main Regret Theorem
Proof Sketch of Theorem \ref{['thm:expected-regret-total']}
...and 51 more sections

Key Result

Theorem 1

For any $K$-arm bandit problem with reward distribution supported on $[0,1]$, KL-MS has regret bounded as follows. For any $\Delta \geq 0$ and $c \in (0, \frac{1}{4}]$:

Figures (15)

Figure 1: $\mu = [0.20, 0.25], T = 10,000$
Figure 2: $\mu = [0.80, 0.90], T = 10,000$
Figure 3: $M = 10^3$
Figure 4: $M = 10^4$
Figure 5: $M = 10^5$
...and 10 more figures

Theorems & Definitions (34)

Theorem 1
Theorem 2: Sub-UCB
Theorem 3: Adaptive worst-case regret
Theorem 4
Lemma 5
Remark 6
Remark 7
Lemma 8
Lemma 9: Lemma \ref{['lemma:expected-sub-optimal-arm-pull-main']} restated
Lemma 10
...and 24 more

Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

TL;DR

Abstract

Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (34)