Near-Optimal Regret for KL-Regularized Multi-Armed Bandits

Kaixuan Ji; Qingyue Zhao; Heyang Zhao; Qiwei Di; Quanquan Gu

Near-Optimal Regret for KL-Regularized Multi-Armed Bandits

Kaixuan Ji, Qingyue Zhao, Heyang Zhao, Qiwei Di, Quanquan Gu

TL;DR

This work addresses the problem of statistical efficiency of online learning with respect to KL-regularized MABs via a sharp analysis of KL-UCB using a novel peeling argument, and shows that the KL-regularized regret for MABs is $\eta$-independent and scales as $\tilde{\Theta}(\sqrt{KT})$.

Abstract

Recent studies have shown that reinforcement learning with KL-regularized objectives can enjoy faster rates of convergence or logarithmic regret, in contrast to the classical $\sqrt{T}$-type regret in the unregularized setting. However, the statistical efficiency of online learning with respect to KL-regularized objectives remains far from completely characterized, even when specialized to multi-armed bandits (MABs). We address this problem for MABs via a sharp analysis of KL-UCB using a novel peeling argument, which yields a $\tilde{O}(ηK\log^2T)$ upper bound: the first high-probability regret bound with linear dependence on $K$. Here, $T$ is the time horizon, $K$ is the number of arms, $η^{-1}$ is the regularization intensity, and $\tilde{O}$ hides all logarithmic factors except those involving $\log T$. The near-tightness of our analysis is certified by the first non-constant lower bound $Ω(ηK \log T)$, which follows from subtle hard-instance constructions and a tailored decomposition of the Bayes prior. Moreover, in the low-regularization regime (i.e., large $η$), we show that the KL-regularized regret for MABs is $η$-independent and scales as $\tildeΘ(\sqrt{KT})$. Overall, our results provide a thorough understanding of KL-regularized MABs across all regimes of $η$ and yield nearly optimal bounds in terms of $K$, $η$, and $T$.

Near-Optimal Regret for KL-Regularized Multi-Armed Bandits

TL;DR

-independent and scales as

Abstract

Recent studies have shown that reinforcement learning with KL-regularized objectives can enjoy faster rates of convergence or logarithmic regret, in contrast to the classical

-type regret in the unregularized setting. However, the statistical efficiency of online learning with respect to KL-regularized objectives remains far from completely characterized, even when specialized to multi-armed bandits (MABs). We address this problem for MABs via a sharp analysis of KL-UCB using a novel peeling argument, which yields a

upper bound: the first high-probability regret bound with linear dependence on

. Here,

is the time horizon,

is the number of arms,

is the regularization intensity, and

hides all logarithmic factors except those involving

. The near-tightness of our analysis is certified by the first non-constant lower bound

, which follows from subtle hard-instance constructions and a tailored decomposition of the Bayes prior. Moreover, in the low-regularization regime (i.e., large

), we show that the KL-regularized regret for MABs is

-independent and scales as

. Overall, our results provide a thorough understanding of KL-regularized MABs across all regimes of

and yield nearly optimal bounds in terms of

, and

Paper Structure (31 sections, 13 theorems, 87 equations, 2 figures, 1 table)

This paper contains 31 sections, 13 theorems, 87 equations, 2 figures, 1 table.

Introduction
Notation.
Related Work
Optimism in Multi-armed Bandits.
RL with KL-Regularization.
Problem Setup
Algorithm and Regret Analysis
Algorithm Description
Theoretical Guarantee
Proof Sketch of \ref{['thm:upperbound']}
Bounding $I_1$ (Harmonic Sum).
Bounding $I_2$ (Peeling Technique).
Lower Bounds
Proof Overview of Hardness Results
Proof Overview of Theorem \ref{['thm:lowerbound-slow']}
...and 16 more sections

Key Result

Lemma 4.1

Given $\delta >0$, let $\mathcal{E}(\delta)$ denote the event that our constructed optimistic reward function is indeed larger than true reward mean, i.e., Then the event $\mathcal{E}(\delta)$ holds with probability at least $1-\delta$.

Figures (2)

Figure 1: The near-comprehensive picture of KL-regularized MABs rendered in this paper. All logarithmic factors except $\log T$ are omitted to avoid clutter.
Figure 2: The shared uniform Bayes prior for every $t \geq \eta^2 K$. The plot above takes $1$ out of $K$ axes of $\mathsf{Unif}([-\alpha, +\alpha]^K)$ for illustration. The gray boxes denote the density of $\mathbf{x}$ and hence the red boxes represent the density of $\mathbf{x} + {\bm{\mu}}\delta_t$.

Theorems & Definitions (16)

Lemma 4.1
Theorem 4.2
Remark 4.3
Theorem 5.1: Low-regularization regime
Corollary 5.2
Theorem 5.3: High-regularization regime
Remark 5.4
Remark 6.1
Lemma A.1: Lemma A.1, zhao2025logarithmic
Lemma A.2
...and 6 more

Near-Optimal Regret for KL-Regularized Multi-Armed Bandits

TL;DR

Abstract

Near-Optimal Regret for KL-Regularized Multi-Armed Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (16)