The Power of Resets in Online Reinforcement Learning

Zakaria Mhammedi; Dylan J. Foster; Alexander Rakhlin

The Power of Resets in Online Reinforcement Learning

Zakaria Mhammedi, Dylan J. Foster, Alexander Rakhlin

TL;DR

This work investigates online reinforcement learning with local simulator access, introducing the notion of local planning where the agent can reset to previously observed states to leverage known dynamics. It proves new statistical guarantees in high-dimensional settings under low coverability and $Q^{\star}$-realizability via SimGolf, and shows tractable learning for Exogenous Block MDPs; it then provides a computationally efficient alternative, RVFS, that achieves polynomial sample complexity under pushforward coverability and stronger realizability assumptions. RVFS avoids the limitations of global optimism by using core-sets and a recursive value-function search, and applies to ExBMDPs, including cases with weakly correlated exogenous noise through $\texttt{RVFS}^{\text{exo}}$ and randomized rounding. Together, the results illuminate when and how local simulator access yields fundamental speedups in online RL with nonlinear function approximation, and forge a principled connection between planning with simulators and learning with powerful function classes. The findings have practical significance for domains where accurate simulators are available but end-to-end online RL with neural networks has been challenging due to sample complexity and distribution shift.

Abstract

Simulators are a pervasive tool in reinforcement learning, but most existing algorithms cannot efficiently exploit simulator access -- particularly in high-dimensional domains that require general function approximation. We explore the power of simulators through online reinforcement learning with {local simulator access} (or, local planning), an RL protocol where the agent is allowed to reset to previously observed states and follow their dynamics during training. We use local simulator access to unlock new statistical guarantees that were previously out of reach: - We show that MDPs with low coverability (Xie et al. 2023) -- a general structural condition that subsumes Block MDPs and Low-Rank MDPs -- can be learned in a sample-efficient fashion with only $Q^{\star}$-realizability (realizability of the optimal state-value function); existing online RL algorithms require significantly stronger representation conditions. - As a consequence, we show that the notorious Exogenous Block MDP problem (Efroni et al. 2022) is tractable under local simulator access. The results above are achieved through a computationally inefficient algorithm. We complement them with a more computationally efficient algorithm, RVFS (Recursive Value Function Search), which achieves provable sample complexity guarantees under a strengthened statistical assumption known as pushforward coverability. RVFS can be viewed as a principled, provable counterpart to a successful empirical paradigm that combines recursive search (e.g., MCTS) with value function approximation.

The Power of Resets in Online Reinforcement Learning

TL;DR

-realizability via SimGolf, and shows tractable learning for Exogenous Block MDPs; it then provides a computationally efficient alternative, RVFS, that achieves polynomial sample complexity under pushforward coverability and stronger realizability assumptions. RVFS avoids the limitations of global optimism by using core-sets and a recursive value-function search, and applies to ExBMDPs, including cases with weakly correlated exogenous noise through

and randomized rounding. Together, the results illuminate when and how local simulator access yields fundamental speedups in online RL with nonlinear function approximation, and forge a principled connection between planning with simulators and learning with powerful function classes. The findings have practical significance for domains where accurate simulators are available but end-to-end online RL with neural networks has been challenging due to sample complexity and distribution shift.

Abstract

-realizability (realizability of the optimal state-value function); existing online RL algorithms require significantly stronger representation conditions. - As a consequence, we show that the notorious Exogenous Block MDP problem (Efroni et al. 2022) is tractable under local simulator access. The results above are achieved through a computationally inefficient algorithm. We complement them with a more computationally efficient algorithm, RVFS (Recursive Value Function Search), which achieves provable sample complexity guarantees under a strengthened statistical assumption known as pushforward coverability. RVFS can be viewed as a principled, provable counterpart to a successful empirical paradigm that combines recursive search (e.g., MCTS) with value function approximation.

Paper Structure (108 sections, 39 theorems, 245 equations, 8 algorithms)

This paper contains 108 sections, 39 theorems, 245 equations, 8 algorithms.

Introduction
Contributions
Sample-efficient learning
Practical, computationally efficient learning
Paper organization
Setup: Reinforcement Learning with Local Simulator Access
Online Reinforcement Learning with Local Simulator Access
Executable versus non-executable policies
Implications for planning
Additional Notation
New Sample-Efficient Learning Guarantees via Local Simulators
Function approximation setup and coverability
Coverability
Algorithm
Main Result
...and 93 more sections

Key Result

Theorem 2

Let $\varepsilon, \delta \in(0,1)$ be given and suppose ass:realgolf ($Q^\star$-realizability) and ass:cover (coverability) hold with $C_{\texttt{cov}}>0$. Then the policy $\widehat{\pi}$ produced by $\texttt{SimGolf}\xspace(\mathcal{Q}, C_{\texttt{cov}},\varepsilon, \delta)$ (alg:generative_golf) h

Theorems & Definitions (43)

Definition 1: Non-executable policy
Remark 1: Squared Bellman error versus average Bellman error
Theorem 2: Main guarantee for SimGolf
Lemma 1: efroni2022sample
Corollary 1: SimGolf for ExBMDPs
Remark 2
Theorem 3: Main guarantee for RVFS
Lemma 2: efroni2022sample
Theorem 4: Main guarantee of $\texttt{RVFS}\xspace^{\texttt{exo}}$ for EXBMDPs
Lemma 3
...and 33 more

The Power of Resets in Online Reinforcement Learning

TL;DR

Abstract

The Power of Resets in Online Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (43)