Scalable Online Exploration via Coverability

Philip Amortila; Dylan J. Foster; Akshay Krishnamurthy

Scalable Online Exploration via Coverability

Philip Amortila, Dylan J. Foster, Akshay Krishnamurthy

TL;DR

This work tackles exploration in high-dimensional RL by proposing policy-coverage objectives, centered on the $L_1$-Coverage objective whose optimal value defines $L_1$-Coverability. It develops computationally efficient planning relaxations via $L_{\infty}$-coverability and pushforward relaxations, and introduces CODEX for reward-free model-based exploration and CODEX.W for model-free exploration. Theoretical guarantees show that bounded $L_1$-Coverability yields sample-efficient exploration and enables reliable downstream policy optimization, including offline-to-online transitions via the DEC framework. Empirically, the approach improves state-space exploration on MountainCar relative to baselines, and the framework unifies exploration with standard RL pipelines, offering scalable, end-to-end exploration guarantees with nonlinear function approximation.

Abstract

Exploration is a major challenge in reinforcement learning, especially for high-dimensional domains that require function approximation. We propose exploration objectives -- policy optimization objectives that enable downstream maximization of any reward function -- as a conceptual framework to systematize the study of exploration. Within this framework, we introduce a new objective, $L_1$-Coverage, which generalizes previous exploration schemes and supports three fundamental desiderata: 1. Intrinsic complexity control. $L_1$-Coverage is associated with a structural parameter, $L_1$-Coverability, which reflects the intrinsic statistical difficulty of the underlying MDP, subsuming Block and Low-Rank MDPs. 2. Efficient planning. For a known MDP, optimizing $L_1$-Coverage efficiently reduces to standard policy optimization, allowing flexible integration with off-the-shelf methods such as policy gradient and Q-learning approaches. 3. Efficient exploration. $L_1$-Coverage enables the first computationally efficient model-based and model-free algorithms for online (reward-free or reward-driven) reinforcement learning in MDPs with low coverability. Empirically, we find that $L_1$-Coverage effectively drives off-the-shelf policy optimization algorithms to explore the state space.

Scalable Online Exploration via Coverability

TL;DR

This work tackles exploration in high-dimensional RL by proposing policy-coverage objectives, centered on the

-Coverage objective whose optimal value defines

-Coverability. It develops computationally efficient planning relaxations via

-coverability and pushforward relaxations, and introduces CODEX for reward-free model-based exploration and CODEX.W for model-free exploration. Theoretical guarantees show that bounded

-Coverability yields sample-efficient exploration and enables reliable downstream policy optimization, including offline-to-online transitions via the DEC framework. Empirically, the approach improves state-space exploration on MountainCar relative to baselines, and the framework unifies exploration with standard RL pipelines, offering scalable, end-to-end exploration guarantees with nonlinear function approximation.

Abstract

-Coverage, which generalizes previous exploration schemes and supports three fundamental desiderata: 1. Intrinsic complexity control.

-Coverage is associated with a structural parameter,

-Coverability, which reflects the intrinsic statistical difficulty of the underlying MDP, subsuming Block and Low-Rank MDPs. 2. Efficient planning. For a known MDP, optimizing

-Coverage efficiently reduces to standard policy optimization, allowing flexible integration with off-the-shelf methods such as policy gradient and Q-learning approaches. 3. Efficient exploration.

-Coverage enables the first computationally efficient model-based and model-free algorithms for online (reward-free or reward-driven) reinforcement learning in MDPs with low coverability. Empirically, we find that

-Coverage effectively drives off-the-shelf policy optimization algorithms to explore the state space.

Paper Structure (107 sections, 78 theorems, 301 equations, 4 figures, 7 algorithms)

This paper contains 107 sections, 78 theorems, 301 equations, 4 figures, 7 algorithms.

Introduction
Contributions
Paper organization
Online Reinforcement Learning and Exploration Objectives
Additional notation
Exploration Objectives
The $L_1$-Coverage Objective
$L_1$-Coverage objective
$L_1$-Coverage enables downstream policy optimization
$L_1$-Coverability provides intrinsic complexity control
Optimizing $L_1$-Coverage: Efficient Planning
The $L_{\infty}$-Coverability Relaxation
The algorithm
Examples
The Pushforward Coverability Relaxation
...and 92 more sections

Key Result

Proposition 1

For any distribution $p\in\Delta(\Pi_{\mathsf{rns}})$, we have that for all functions $g:\mathcal{X}\times\mathcal{A}\to{}[0,B]$, all $\pi\in\Pi$, and all $\varepsilon>0$,This result is meaningful in the parameter regime where $\Psi_{h,\varepsilon}^{{M}}(p)<1/\varepsilon$. We refer to this regime as

Figures (4)

Figure 1: Number of unique discrete states visited (mean/standard error over 10 runs) and occupancy heatmaps for each policy cover obtained by $L_1$-Coverage (\ref{['alg:linf_relaxation']}), MaxEnt, and uniform exploration. Each epoch comprises a single policy update in \ref{['alg:linf_relaxation']} and MaxEnt, obtained through 1000 steps of REINFORCE with rollouts of length 400. Heatmap legend: velocity (x-axis), position (y-axis), start state ($\bullet$), goal state with $0$ velocity ($\star$).
Figure 2: Entropy and $L_1$-Coverability measured on each policy cover obtained from $L_1$-Coverage (\ref{['alg:linf_relaxation']}), MaxEnt, and uniform exploration on the MountainCar environment. We plot the mean and standard error across 10 runs. Each epoch corresponds to a single policy update in \ref{['alg:linf_relaxation']} and MaxEnt, obtained through 1000 steps of REINFORCE with rollouts of length 400.
Figure 3: Number of discrete states visited (mean and standard error over 10 runs) and occupancy heatmaps for each policy cover obtained from $L_1$-Coverage (\ref{['alg:linf_relaxation']}), MaxEnt, and uniform exploration in the Pendulum environment. Each epoch comprises a single policy update in \ref{['alg:linf_relaxation']} and MaxEnt, obtained through 1000 steps of REINFORCE with rollouts of length 400. Heatmap axes: torque (x-axis) and angle (y-axis). Start state indicated by $\bullet$.
Figure 4: Entropy and $L_1$-Coverability measured on each policy cover obtained from $L_1$-Coverage (\ref{['alg:linf_relaxation']}), MaxEnt, and uniform exploration on the Pendulum environment. We plot the mean and standard error across 10 runs. Each epoch corresponds to a single policy update in \ref{['alg:linf_relaxation']} and MaxEnt, obtained through 1000 steps of REINFORCE with rollouts of length 400.

Theorems & Definitions (91)

Definition 2.1: Exploration objective
Remark 1
Proposition 1: Change of measure for
Proposition 2
Proposition 3
Theorem 1
Remark 2
Proposition 4
Theorem 2
Theorem 3: Guarantee for under $L_\infty$-Coverability
...and 81 more

Scalable Online Exploration via Coverability

TL;DR

Abstract

Scalable Online Exploration via Coverability

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (91)