Table of Contents
Fetching ...

The Power of Resets in Online Reinforcement Learning

Zakaria Mhammedi, Dylan J. Foster, Alexander Rakhlin

TL;DR

This work investigates online reinforcement learning with local simulator access, introducing the notion of local planning where the agent can reset to previously observed states to leverage known dynamics. It proves new statistical guarantees in high-dimensional settings under low coverability and $Q^{\star}$-realizability via SimGolf, and shows tractable learning for Exogenous Block MDPs; it then provides a computationally efficient alternative, RVFS, that achieves polynomial sample complexity under pushforward coverability and stronger realizability assumptions. RVFS avoids the limitations of global optimism by using core-sets and a recursive value-function search, and applies to ExBMDPs, including cases with weakly correlated exogenous noise through $\texttt{RVFS}^{\text{exo}}$ and randomized rounding. Together, the results illuminate when and how local simulator access yields fundamental speedups in online RL with nonlinear function approximation, and forge a principled connection between planning with simulators and learning with powerful function classes. The findings have practical significance for domains where accurate simulators are available but end-to-end online RL with neural networks has been challenging due to sample complexity and distribution shift.

Abstract

Simulators are a pervasive tool in reinforcement learning, but most existing algorithms cannot efficiently exploit simulator access -- particularly in high-dimensional domains that require general function approximation. We explore the power of simulators through online reinforcement learning with {local simulator access} (or, local planning), an RL protocol where the agent is allowed to reset to previously observed states and follow their dynamics during training. We use local simulator access to unlock new statistical guarantees that were previously out of reach: - We show that MDPs with low coverability (Xie et al. 2023) -- a general structural condition that subsumes Block MDPs and Low-Rank MDPs -- can be learned in a sample-efficient fashion with only $Q^{\star}$-realizability (realizability of the optimal state-value function); existing online RL algorithms require significantly stronger representation conditions. - As a consequence, we show that the notorious Exogenous Block MDP problem (Efroni et al. 2022) is tractable under local simulator access. The results above are achieved through a computationally inefficient algorithm. We complement them with a more computationally efficient algorithm, RVFS (Recursive Value Function Search), which achieves provable sample complexity guarantees under a strengthened statistical assumption known as pushforward coverability. RVFS can be viewed as a principled, provable counterpart to a successful empirical paradigm that combines recursive search (e.g., MCTS) with value function approximation.

The Power of Resets in Online Reinforcement Learning

TL;DR

This work investigates online reinforcement learning with local simulator access, introducing the notion of local planning where the agent can reset to previously observed states to leverage known dynamics. It proves new statistical guarantees in high-dimensional settings under low coverability and -realizability via SimGolf, and shows tractable learning for Exogenous Block MDPs; it then provides a computationally efficient alternative, RVFS, that achieves polynomial sample complexity under pushforward coverability and stronger realizability assumptions. RVFS avoids the limitations of global optimism by using core-sets and a recursive value-function search, and applies to ExBMDPs, including cases with weakly correlated exogenous noise through and randomized rounding. Together, the results illuminate when and how local simulator access yields fundamental speedups in online RL with nonlinear function approximation, and forge a principled connection between planning with simulators and learning with powerful function classes. The findings have practical significance for domains where accurate simulators are available but end-to-end online RL with neural networks has been challenging due to sample complexity and distribution shift.

Abstract

Simulators are a pervasive tool in reinforcement learning, but most existing algorithms cannot efficiently exploit simulator access -- particularly in high-dimensional domains that require general function approximation. We explore the power of simulators through online reinforcement learning with {local simulator access} (or, local planning), an RL protocol where the agent is allowed to reset to previously observed states and follow their dynamics during training. We use local simulator access to unlock new statistical guarantees that were previously out of reach: - We show that MDPs with low coverability (Xie et al. 2023) -- a general structural condition that subsumes Block MDPs and Low-Rank MDPs -- can be learned in a sample-efficient fashion with only -realizability (realizability of the optimal state-value function); existing online RL algorithms require significantly stronger representation conditions. - As a consequence, we show that the notorious Exogenous Block MDP problem (Efroni et al. 2022) is tractable under local simulator access. The results above are achieved through a computationally inefficient algorithm. We complement them with a more computationally efficient algorithm, RVFS (Recursive Value Function Search), which achieves provable sample complexity guarantees under a strengthened statistical assumption known as pushforward coverability. RVFS can be viewed as a principled, provable counterpart to a successful empirical paradigm that combines recursive search (e.g., MCTS) with value function approximation.
Paper Structure (108 sections, 39 theorems, 245 equations, 8 algorithms)

This paper contains 108 sections, 39 theorems, 245 equations, 8 algorithms.

Key Result

Theorem 2

Let $\varepsilon, \delta \in(0,1)$ be given and suppose ass:realgolf ($Q^\star$-realizability) and ass:cover (coverability) hold with $C_{\texttt{cov}}>0$. Then the policy $\widehat{\pi}$ produced by $\texttt{SimGolf}\xspace(\mathcal{Q}, C_{\texttt{cov}},\varepsilon, \delta)$ (alg:generative_golf) h

Theorems & Definitions (43)

  • Definition 1: Non-executable policy
  • Remark 1: Squared Bellman error versus average Bellman error
  • Theorem 2: Main guarantee for SimGolf
  • Lemma 1: efroni2022sample
  • Corollary 1: SimGolf for ExBMDPs
  • Remark 2
  • Theorem 3: Main guarantee for RVFS
  • Lemma 2: efroni2022sample
  • Theorem 4: Main guarantee of $\texttt{RVFS}\xspace^{\texttt{exo}}$ for EXBMDPs
  • Lemma 3
  • ...and 33 more