Table of Contents
Fetching ...

Worst-Case Regret Bounds for Exploration via Randomized Value Functions

Daniel Russo

TL;DR

This work analyzes exploration in reinforcement learning via randomized value functions, focusing on Randomized Least Squares Value Iteration (RLSVI) applied to tabular finite-horizon MDPs. By injecting Gaussian noise into the empirical rewards and solving a perturbed MDP, RLSVI induces sophisticated exploration that can be analyzed with a sequence of concentration-based lemmas and an optimism argument. The main contribution is a worst-case regret bound of $\tilde{O}( H^{3} S^{3/2} \sqrt{A K} )$ for RLSVI with $\beta_k = \tfrac{1}{2}SH^3 \log(2HSAk)$, establishing polynomial regret guarantees for this exploration strategy. This advances principled exploration that remains compatible with practical value-function learning, providing theoretical guarantees beyond simple $\epsilon$-greedy or Boltzmann methods in tabular settings.

Abstract

This paper studies a recent proposal to use randomized value functions to drive exploration in reinforcement learning. These randomized value functions are generated by injecting random noise into the training data, making the approach compatible with many popular methods for estimating parameterized value functions. By providing a worst-case regret bound for tabular finite-horizon Markov decision processes, we show that planning with respect to these randomized value functions can induce provably efficient exploration.

Worst-Case Regret Bounds for Exploration via Randomized Value Functions

TL;DR

This work analyzes exploration in reinforcement learning via randomized value functions, focusing on Randomized Least Squares Value Iteration (RLSVI) applied to tabular finite-horizon MDPs. By injecting Gaussian noise into the empirical rewards and solving a perturbed MDP, RLSVI induces sophisticated exploration that can be analyzed with a sequence of concentration-based lemmas and an optimism argument. The main contribution is a worst-case regret bound of for RLSVI with , establishing polynomial regret guarantees for this exploration strategy. This advances principled exploration that remains compatible with practical value-function learning, providing theoretical guarantees beyond simple -greedy or Boltzmann methods in tabular settings.

Abstract

This paper studies a recent proposal to use randomized value functions to drive exploration in reinforcement learning. These randomized value functions are generated by injecting random noise into the training data, making the approach compatible with many popular methods for estimating parameterized value functions. By providing a worst-case regret bound for tabular finite-horizon Markov decision processes, we show that planning with respect to these randomized value functions can induce provably efficient exploration.

Paper Structure

This paper contains 21 sections, 20 theorems, 65 equations, 1 algorithm.

Key Result

Theorem 1

Let $\mathcal{M}$ denote the set of MDPs with horizon $H$, $S$ states, $A$ actions, and rewards bounded in [0,1]. Then for a tuning parameter sequence $\beta=\{\beta_{k}\}_{k\in \mathbb{N}}$ with $\beta_k = \frac{1}{2}SH^3 \log(2HSAk)$,

Theorems & Definitions (31)

  • Theorem 1
  • Remark 1
  • Lemma 1: Validity of confidence sets
  • Lemma 2
  • proof
  • Lemma 3
  • Lemma 4
  • proof
  • proof : Proof of Lemma \ref{['lem: optimism']}
  • Lemma 4
  • ...and 21 more