Table of Contents
Fetching ...

Scalable Online Exploration via Coverability

Philip Amortila, Dylan J. Foster, Akshay Krishnamurthy

TL;DR

This work tackles exploration in high-dimensional RL by proposing policy-coverage objectives, centered on the $L_1$-Coverage objective whose optimal value defines $L_1$-Coverability. It develops computationally efficient planning relaxations via $L_{\infty}$-coverability and pushforward relaxations, and introduces CODEX for reward-free model-based exploration and CODEX.W for model-free exploration. Theoretical guarantees show that bounded $L_1$-Coverability yields sample-efficient exploration and enables reliable downstream policy optimization, including offline-to-online transitions via the DEC framework. Empirically, the approach improves state-space exploration on MountainCar relative to baselines, and the framework unifies exploration with standard RL pipelines, offering scalable, end-to-end exploration guarantees with nonlinear function approximation.

Abstract

Exploration is a major challenge in reinforcement learning, especially for high-dimensional domains that require function approximation. We propose exploration objectives -- policy optimization objectives that enable downstream maximization of any reward function -- as a conceptual framework to systematize the study of exploration. Within this framework, we introduce a new objective, $L_1$-Coverage, which generalizes previous exploration schemes and supports three fundamental desiderata: 1. Intrinsic complexity control. $L_1$-Coverage is associated with a structural parameter, $L_1$-Coverability, which reflects the intrinsic statistical difficulty of the underlying MDP, subsuming Block and Low-Rank MDPs. 2. Efficient planning. For a known MDP, optimizing $L_1$-Coverage efficiently reduces to standard policy optimization, allowing flexible integration with off-the-shelf methods such as policy gradient and Q-learning approaches. 3. Efficient exploration. $L_1$-Coverage enables the first computationally efficient model-based and model-free algorithms for online (reward-free or reward-driven) reinforcement learning in MDPs with low coverability. Empirically, we find that $L_1$-Coverage effectively drives off-the-shelf policy optimization algorithms to explore the state space.

Scalable Online Exploration via Coverability

TL;DR

This work tackles exploration in high-dimensional RL by proposing policy-coverage objectives, centered on the -Coverage objective whose optimal value defines -Coverability. It develops computationally efficient planning relaxations via -coverability and pushforward relaxations, and introduces CODEX for reward-free model-based exploration and CODEX.W for model-free exploration. Theoretical guarantees show that bounded -Coverability yields sample-efficient exploration and enables reliable downstream policy optimization, including offline-to-online transitions via the DEC framework. Empirically, the approach improves state-space exploration on MountainCar relative to baselines, and the framework unifies exploration with standard RL pipelines, offering scalable, end-to-end exploration guarantees with nonlinear function approximation.

Abstract

Exploration is a major challenge in reinforcement learning, especially for high-dimensional domains that require function approximation. We propose exploration objectives -- policy optimization objectives that enable downstream maximization of any reward function -- as a conceptual framework to systematize the study of exploration. Within this framework, we introduce a new objective, -Coverage, which generalizes previous exploration schemes and supports three fundamental desiderata: 1. Intrinsic complexity control. -Coverage is associated with a structural parameter, -Coverability, which reflects the intrinsic statistical difficulty of the underlying MDP, subsuming Block and Low-Rank MDPs. 2. Efficient planning. For a known MDP, optimizing -Coverage efficiently reduces to standard policy optimization, allowing flexible integration with off-the-shelf methods such as policy gradient and Q-learning approaches. 3. Efficient exploration. -Coverage enables the first computationally efficient model-based and model-free algorithms for online (reward-free or reward-driven) reinforcement learning in MDPs with low coverability. Empirically, we find that -Coverage effectively drives off-the-shelf policy optimization algorithms to explore the state space.
Paper Structure (107 sections, 78 theorems, 301 equations, 4 figures, 7 algorithms)

This paper contains 107 sections, 78 theorems, 301 equations, 4 figures, 7 algorithms.

Key Result

Proposition 1

For any distribution $p\in\Delta(\Pi_{\mathsf{rns}})$, we have that for all functions $g:\mathcal{X}\times\mathcal{A}\to{}[0,B]$, all $\pi\in\Pi$, and all $\varepsilon>0$,This result is meaningful in the parameter regime where $\Psi_{h,\varepsilon}^{{M}}(p)<1/\varepsilon$. We refer to this regime as

Figures (4)

  • Figure 1: Number of unique discrete states visited (mean/standard error over 10 runs) and occupancy heatmaps for each policy cover obtained by $L_1$-Coverage (\ref{['alg:linf_relaxation']}), MaxEnt, and uniform exploration. Each epoch comprises a single policy update in \ref{['alg:linf_relaxation']} and MaxEnt, obtained through 1000 steps of REINFORCE with rollouts of length 400. Heatmap legend: velocity (x-axis), position (y-axis), start state ($\bullet$), goal state with $0$ velocity ($\star$).
  • Figure 2: Entropy and $L_1$-Coverability measured on each policy cover obtained from $L_1$-Coverage (\ref{['alg:linf_relaxation']}), MaxEnt, and uniform exploration on the MountainCar environment. We plot the mean and standard error across 10 runs. Each epoch corresponds to a single policy update in \ref{['alg:linf_relaxation']} and MaxEnt, obtained through 1000 steps of REINFORCE with rollouts of length 400.
  • Figure 3: Number of discrete states visited (mean and standard error over 10 runs) and occupancy heatmaps for each policy cover obtained from $L_1$-Coverage (\ref{['alg:linf_relaxation']}), MaxEnt, and uniform exploration in the Pendulum environment. Each epoch comprises a single policy update in \ref{['alg:linf_relaxation']} and MaxEnt, obtained through 1000 steps of REINFORCE with rollouts of length 400. Heatmap axes: torque (x-axis) and angle (y-axis). Start state indicated by $\bullet$.
  • Figure 4: Entropy and $L_1$-Coverability measured on each policy cover obtained from $L_1$-Coverage (\ref{['alg:linf_relaxation']}), MaxEnt, and uniform exploration on the Pendulum environment. We plot the mean and standard error across 10 runs. Each epoch corresponds to a single policy update in \ref{['alg:linf_relaxation']} and MaxEnt, obtained through 1000 steps of REINFORCE with rollouts of length 400.

Theorems & Definitions (91)

  • Definition 2.1: Exploration objective
  • Remark 1
  • Proposition 1: Change of measure for
  • Proposition 2
  • Proposition 3
  • Theorem 1
  • Remark 2
  • Proposition 4
  • Theorem 2
  • Theorem 3: Guarantee for under $L_\infty$-Coverability
  • ...and 81 more