Table of Contents
Fetching ...

Near-Optimal Regret for KL-Regularized Multi-Armed Bandits

Kaixuan Ji, Qingyue Zhao, Heyang Zhao, Qiwei Di, Quanquan Gu

TL;DR

This work addresses the problem of statistical efficiency of online learning with respect to KL-regularized MABs via a sharp analysis of KL-UCB using a novel peeling argument, and shows that the KL-regularized regret for MABs is $\eta$-independent and scales as $\tilde{\Theta}(\sqrt{KT})$.

Abstract

Recent studies have shown that reinforcement learning with KL-regularized objectives can enjoy faster rates of convergence or logarithmic regret, in contrast to the classical $\sqrt{T}$-type regret in the unregularized setting. However, the statistical efficiency of online learning with respect to KL-regularized objectives remains far from completely characterized, even when specialized to multi-armed bandits (MABs). We address this problem for MABs via a sharp analysis of KL-UCB using a novel peeling argument, which yields a $\tilde{O}(ηK\log^2T)$ upper bound: the first high-probability regret bound with linear dependence on $K$. Here, $T$ is the time horizon, $K$ is the number of arms, $η^{-1}$ is the regularization intensity, and $\tilde{O}$ hides all logarithmic factors except those involving $\log T$. The near-tightness of our analysis is certified by the first non-constant lower bound $Ω(ηK \log T)$, which follows from subtle hard-instance constructions and a tailored decomposition of the Bayes prior. Moreover, in the low-regularization regime (i.e., large $η$), we show that the KL-regularized regret for MABs is $η$-independent and scales as $\tildeΘ(\sqrt{KT})$. Overall, our results provide a thorough understanding of KL-regularized MABs across all regimes of $η$ and yield nearly optimal bounds in terms of $K$, $η$, and $T$.

Near-Optimal Regret for KL-Regularized Multi-Armed Bandits

TL;DR

This work addresses the problem of statistical efficiency of online learning with respect to KL-regularized MABs via a sharp analysis of KL-UCB using a novel peeling argument, and shows that the KL-regularized regret for MABs is -independent and scales as .

Abstract

Recent studies have shown that reinforcement learning with KL-regularized objectives can enjoy faster rates of convergence or logarithmic regret, in contrast to the classical -type regret in the unregularized setting. However, the statistical efficiency of online learning with respect to KL-regularized objectives remains far from completely characterized, even when specialized to multi-armed bandits (MABs). We address this problem for MABs via a sharp analysis of KL-UCB using a novel peeling argument, which yields a upper bound: the first high-probability regret bound with linear dependence on . Here, is the time horizon, is the number of arms, is the regularization intensity, and hides all logarithmic factors except those involving . The near-tightness of our analysis is certified by the first non-constant lower bound , which follows from subtle hard-instance constructions and a tailored decomposition of the Bayes prior. Moreover, in the low-regularization regime (i.e., large ), we show that the KL-regularized regret for MABs is -independent and scales as . Overall, our results provide a thorough understanding of KL-regularized MABs across all regimes of and yield nearly optimal bounds in terms of , , and .
Paper Structure (31 sections, 13 theorems, 87 equations, 2 figures, 1 table)

This paper contains 31 sections, 13 theorems, 87 equations, 2 figures, 1 table.

Key Result

Lemma 4.1

Given $\delta >0$, let $\mathcal{E}(\delta)$ denote the event that our constructed optimistic reward function is indeed larger than true reward mean, i.e., Then the event $\mathcal{E}(\delta)$ holds with probability at least $1-\delta$.

Figures (2)

  • Figure 1: The near-comprehensive picture of KL-regularized MABs rendered in this paper. All logarithmic factors except $\log T$ are omitted to avoid clutter.
  • Figure 2: The shared uniform Bayes prior for every $t \geq \eta^2 K$. The plot above takes $1$ out of $K$ axes of $\mathsf{Unif}([-\alpha, +\alpha]^K)$ for illustration. The gray boxes denote the density of $\mathbf{x}$ and hence the red boxes represent the density of $\mathbf{x} + {\bm{\mu}}\delta_t$.

Theorems & Definitions (16)

  • Lemma 4.1
  • Theorem 4.2
  • Remark 4.3
  • Theorem 5.1: Low-regularization regime
  • Corollary 5.2
  • Theorem 5.3: High-regularization regime
  • Remark 5.4
  • Remark 6.1
  • Lemma A.1: Lemma A.1, zhao2025logarithmic
  • Lemma A.2
  • ...and 6 more