Table of Contents
Fetching ...

UCB algorithms for multi-armed bandits: Precise regret and adaptive inference

Qiyang Han, Koulik Khamaru, Cun-Hui Zhang

TL;DR

The paper addresses the precise regret of UCB1 in multi-armed bandits and develops adaptive inference for mean rewards under sequential, data-accruing decision rules. By introducing a deterministic fixed-point surrogate $n_{a;T}^*$ and a noiseless continuous-time comparison, the authors derive a non-asymptotic regret formula that tightly tracks the actual regret $R(\Theta)$ via $Reg_T^*(\Theta)$, and they delineate regimes where the classical Lai-Robbins bound is informative. They also formulate a stability-based framework that yields quantitative central limit theorems for empirical means and Ridge estimators, enabling valid confidence sets under adaptive data collection and extending to bandits with structured means. Together, these results advance both rigorous performance evaluation of UCB-like algorithms and practical adaptive inference for sequential experiments, with simulations corroborating the theory. The methodology provides a unified view linking precise regret, fixed-point analysis, and adaptive inference, and it opens avenues for broader applicability to sequential decision-making problems with UCB-index algorithms.

Abstract

Upper Confidence Bound (UCB) algorithms are a widely-used class of sequential algorithms for the $K$-armed bandit problem. Despite extensive research over the past decades aimed at understanding their asymptotic and (near) minimax optimality properties, a precise understanding of their regret behavior remains elusive. This gap has not only hindered the evaluation of their actual algorithmic efficiency, but also limited further developments in statistical inference in sequential data collection. This paper bridges these two fundamental aspects--precise regret analysis and adaptive statistical inference--through a deterministic characterization of the number of arm pulls for an UCB index algorithm [Lai87, Agr95, ACBF02]. Our resulting precise regret formula not only accurately captures the actual behavior of the UCB algorithm for finite time horizons and individual problem instances, but also provides significant new insights into the regimes in which the existing theory remains informative. In particular, we show that the classical Lai-Robbins regret formula is exact if and only if the sub-optimality gaps exceed the order $σ\sqrt{K\log T/T}$. We also show that its maximal regret deviates from the minimax regret by a logarithmic factor, and therefore settling its strict minimax optimality in the negative. The deterministic characterization of the number of arm pulls for the UCB algorithm also has major implications in adaptive statistical inference. Building on the seminal work of [Lai82], we show that the UCB algorithm satisfies certain stability properties that lead to quantitative central limit theorems in two settings including the empirical means of unknown rewards in the bandit setting. These results have an important practical implication: conventional confidence sets designed for i.i.d. data remain valid even when data are collected sequentially.

UCB algorithms for multi-armed bandits: Precise regret and adaptive inference

TL;DR

The paper addresses the precise regret of UCB1 in multi-armed bandits and develops adaptive inference for mean rewards under sequential, data-accruing decision rules. By introducing a deterministic fixed-point surrogate and a noiseless continuous-time comparison, the authors derive a non-asymptotic regret formula that tightly tracks the actual regret via , and they delineate regimes where the classical Lai-Robbins bound is informative. They also formulate a stability-based framework that yields quantitative central limit theorems for empirical means and Ridge estimators, enabling valid confidence sets under adaptive data collection and extending to bandits with structured means. Together, these results advance both rigorous performance evaluation of UCB-like algorithms and practical adaptive inference for sequential experiments, with simulations corroborating the theory. The methodology provides a unified view linking precise regret, fixed-point analysis, and adaptive inference, and it opens avenues for broader applicability to sequential decision-making problems with UCB-index algorithms.

Abstract

Upper Confidence Bound (UCB) algorithms are a widely-used class of sequential algorithms for the -armed bandit problem. Despite extensive research over the past decades aimed at understanding their asymptotic and (near) minimax optimality properties, a precise understanding of their regret behavior remains elusive. This gap has not only hindered the evaluation of their actual algorithmic efficiency, but also limited further developments in statistical inference in sequential data collection. This paper bridges these two fundamental aspects--precise regret analysis and adaptive statistical inference--through a deterministic characterization of the number of arm pulls for an UCB index algorithm [Lai87, Agr95, ACBF02]. Our resulting precise regret formula not only accurately captures the actual behavior of the UCB algorithm for finite time horizons and individual problem instances, but also provides significant new insights into the regimes in which the existing theory remains informative. In particular, we show that the classical Lai-Robbins regret formula is exact if and only if the sub-optimality gaps exceed the order . We also show that its maximal regret deviates from the minimax regret by a logarithmic factor, and therefore settling its strict minimax optimality in the negative. The deterministic characterization of the number of arm pulls for the UCB algorithm also has major implications in adaptive statistical inference. Building on the seminal work of [Lai82], we show that the UCB algorithm satisfies certain stability properties that lead to quantitative central limit theorems in two settings including the empirical means of unknown rewards in the bandit setting. These results have an important practical implication: conventional confidence sets designed for i.i.d. data remain valid even when data are collected sequentially.

Paper Structure

This paper contains 40 sections, 19 theorems, 147 equations, 2 figures, 1 algorithm.

Key Result

Theorem 3.2

Suppose Assumption assump:gaussian_noise holds. There exists a universal constant $C>0$ such that if $\mathop{\mathrm{\texttt{err}}}\nolimits(\Theta)\leq 1/C$, then Here $\vartheta_T^\ast\equiv \gamma_T^{-2}\cdot T e^{-\gamma_T^2/2}$.

Figures (2)

  • Figure 1: Left panel: Regrets for various sub-optimality gaps $\Delta \in [0.01,0.25]$. Right panel: Number of arm pulls for various sub-optimality gaps $\Delta \in [0.01,0.05]$.
  • Figure 2: Left panel: Gaussian approximation of the empirical mean for the optimal arm. Right panel: Coverage probabilities for the CI's in (\ref{['def:CI_mean']}) for both arms.

Theorems & Definitions (38)

  • Definition 3.1
  • Remark 1
  • Theorem 3.2
  • Proposition 3.3
  • Remark 2
  • Corollary 3.4
  • Corollary 3.5
  • Theorem 3.6
  • Proposition 3.7
  • Remark 3
  • ...and 28 more