Worst-Case Regret Bounds for Exploration via Randomized Value Functions
Daniel Russo
TL;DR
This work analyzes exploration in reinforcement learning via randomized value functions, focusing on Randomized Least Squares Value Iteration (RLSVI) applied to tabular finite-horizon MDPs. By injecting Gaussian noise into the empirical rewards and solving a perturbed MDP, RLSVI induces sophisticated exploration that can be analyzed with a sequence of concentration-based lemmas and an optimism argument. The main contribution is a worst-case regret bound of $\tilde{O}( H^{3} S^{3/2} \sqrt{A K} )$ for RLSVI with $\beta_k = \tfrac{1}{2}SH^3 \log(2HSAk)$, establishing polynomial regret guarantees for this exploration strategy. This advances principled exploration that remains compatible with practical value-function learning, providing theoretical guarantees beyond simple $\epsilon$-greedy or Boltzmann methods in tabular settings.
Abstract
This paper studies a recent proposal to use randomized value functions to drive exploration in reinforcement learning. These randomized value functions are generated by injecting random noise into the training data, making the approach compatible with many popular methods for estimating parameterized value functions. By providing a worst-case regret bound for tabular finite-horizon Markov decision processes, we show that planning with respect to these randomized value functions can induce provably efficient exploration.
