Table of Contents
Fetching ...

A Refined Analysis of UCBVI

Simone Drago, Marco Mussi, Alberto Maria Metelli

TL;DR

This work refines the analysis of UCBVI for finite-horizon, tabular RL by deriving tighter constants for both Chernoff-Hoeffding and Bernstein-Freedman exploration bonuses. The authors show that the same $\widetilde{\mathcal{O}}(\sqrt{HSAT})$ regret rate can be achieved with substantially smaller constants, and they demonstrate that the improved constants translate into meaningful empirical gains relative to the original UCBVI and the MVP algorithm. Theoretical results are complemented by numerical validation in illustrative environments and the RiverSwim benchmark, where the BF-I variant often yields the lowest regret. By preserving the same asymptotic rate while reducing constants, the refined UCBVI remains a practical and competitive choice for finite-horizon tabular RL.

Abstract

In this work, we provide a refined analysis of the UCBVI algorithm (Azar et al., 2017), improving both the bonus terms and the regret analysis. Additionally, we compare our version of UCBVI with both its original version and the state-of-the-art MVP algorithm. Our empirical validation demonstrates that improving the multiplicative constants in the bounds has significant positive effects on the empirical performance of the algorithms.

A Refined Analysis of UCBVI

TL;DR

This work refines the analysis of UCBVI for finite-horizon, tabular RL by deriving tighter constants for both Chernoff-Hoeffding and Bernstein-Freedman exploration bonuses. The authors show that the same regret rate can be achieved with substantially smaller constants, and they demonstrate that the improved constants translate into meaningful empirical gains relative to the original UCBVI and the MVP algorithm. Theoretical results are complemented by numerical validation in illustrative environments and the RiverSwim benchmark, where the BF-I variant often yields the lowest regret. By preserving the same asymptotic rate while reducing constants, the refined UCBVI remains a practical and competitive choice for finite-horizon tabular RL.

Abstract

In this work, we provide a refined analysis of the UCBVI algorithm (Azar et al., 2017), improving both the bonus terms and the regret analysis. Additionally, we compare our version of UCBVI with both its original version and the state-of-the-art MVP algorithm. Our empirical validation demonstrates that improving the multiplicative constants in the bounds has significant positive effects on the empirical performance of the algorithms.

Paper Structure

This paper contains 20 sections, 15 theorems, 160 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Theorem 4.1

Let $\delta \in (0,1)$. Considering: then, w.p. at least $1-\delta$, the regret of UCBVI-CH is bounded by: where $L = \ln{(5HSAT / \delta)}$. For $T \ge \Omega ( H^2 S^3 A )$, this bound translates to $\widetilde{\mathcal{O}}(H \sqrt{SAT})$.

Figures (2)

  • Figure 1: Performances in terms of cumulative regret in toy environments with $S=3$ states and $A=3$ actions ($10$ runs, mean $\pm$$95\%$ C.I.).
  • Figure 2: Performances in terms of cumulative regret in the RiverSwim environment with $S=5$ states and horizon $H=10$ ($4$ runs, mean $\pm$$95\%$ C.I.).

Theorems & Definitions (15)

  • Theorem 4.1: Regret for with Chernoff-Hoeffding bonus
  • Theorem 4.2: Regret for with Bernstein-Freedman bonus
  • Lemma 1: Bernstein inequality for Bernoulli random variables
  • Lemma 2: Regret decomposition upper bound
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • Lemma 8: Summation over typical episodes of state-action wise model errors
  • ...and 5 more