On Lai's Upper Confidence Bound in Multi-Armed Bandits
Huachen Ren, Cun-Hui Zhang
TL;DR
The paper provides sharp nonasymptotic regret bounds for two Lai–type UCB policies in Gaussian K-armed bandits: (i) a UCB with a fixed exploration level $b_{T'}$ achieving a leading constant matching the Lai–Robbins lower bound, and (ii) a nonasymptotic bound for Lai's UCB index with $g(x)$ linked to $\log x$, both with explicit control of second-order terms via Brownian-boundary analyses. The authors develop a novel analytical approach that leverages boundary-crossing probabilities of random walks, cast as repeated significance tests, and nonlinear renewal theory to bound the number of suboptimal pulls. The results connect nonasymptotic guarantees with classical information-theoretic limits, demonstrating that carefully tuned exploration can attain optimal constants even for finite horizons. The work also discusses extensions to sub-Gaussian rewards and potential applications to broader sequential decision-making settings, including contextual bandits and reinforcement learning.
Abstract
In this memorial paper, we honor Tze Leung Lai's seminal contributions to the topic of multi-armed bandits, with a specific focus on his pioneering work on the upper confidence bound. We establish sharp non-asymptotic regret bounds for an upper confidence bound index with a constant level of exploration for Gaussian rewards. Furthermore, we establish a non-asymptotic regret bound for the upper confidence bound index of Lai (1987) which employs an exploration function that decreases with the sample size of the corresponding arm. The regret bounds have leading constants that match the Lai-Robbins lower bound. Our results highlight an aspect of Lai's seminal works that deserves more attention in the machine learning literature.
