A Refined Analysis of UCBVI
Simone Drago, Marco Mussi, Alberto Maria Metelli
TL;DR
This work refines the analysis of UCBVI for finite-horizon, tabular RL by deriving tighter constants for both Chernoff-Hoeffding and Bernstein-Freedman exploration bonuses. The authors show that the same $\widetilde{\mathcal{O}}(\sqrt{HSAT})$ regret rate can be achieved with substantially smaller constants, and they demonstrate that the improved constants translate into meaningful empirical gains relative to the original UCBVI and the MVP algorithm. Theoretical results are complemented by numerical validation in illustrative environments and the RiverSwim benchmark, where the BF-I variant often yields the lowest regret. By preserving the same asymptotic rate while reducing constants, the refined UCBVI remains a practical and competitive choice for finite-horizon tabular RL.
Abstract
In this work, we provide a refined analysis of the UCBVI algorithm (Azar et al., 2017), improving both the bonus terms and the regret analysis. Additionally, we compare our version of UCBVI with both its original version and the state-of-the-art MVP algorithm. Our empirical validation demonstrates that improving the multiplicative constants in the bounds has significant positive effects on the empirical performance of the algorithms.
