NeoRL: Efficient Exploration for Nonepisodic RL

Bhavya Sukhija; Lenart Treven; Florian Dörfler; Stelian Coros; Andreas Krause

NeoRL: Efficient Exploration for Nonepisodic RL

Bhavya Sukhija, Lenart Treven, Florian Dörfler, Stelian Coros, Andreas Krause

TL;DR

NeoRL addresses nonepisodic reinforcement learning for unknown nonlinear dynamics learned from a single trajectory. It introduces a model-based optimistic approach that plans with epistemic uncertainty and uses RKHS/GP dynamics with a horizon scheduling rule to achieve sublinear regret, proving $R_T \le C(\mathbf{x}_0, K, \gamma) \Gamma_T \sqrt{T}$ with high probability. The method leverages calibrated uncertainty models, MPC-based planning, and a theoretical framework ensuring ergodicity and stability, while empirical results show convergence to the optimal average cost across diverse, high-dimensional environments with limited interactions. This work advances the practical and theoretical foundations of nonepisodic deep RL by enabling stable exploration without resets and providing concrete regret guarantees for nonlinear systems.

Abstract

We study the problem of nonepisodic reinforcement learning (RL) for nonlinear dynamical systems, where the system dynamics are unknown and the RL agent has to learn from a single trajectory, i.e., without resets. We propose Nonepisodic Optimistic RL (NeoRL), an approach based on the principle of optimism in the face of uncertainty. NeoRL uses well-calibrated probabilistic models and plans optimistically w.r.t. the epistemic uncertainty about the unknown dynamics. Under continuity and bounded energy assumptions on the system, we provide a first-of-its-kind regret bound of $O(Γ_T \sqrt{T})$ for general nonlinear systems with Gaussian process dynamics. We compare NeoRL to other baselines on several deep RL environments and empirically demonstrate that NeoRL achieves the optimal average cost while incurring the least regret.

NeoRL: Efficient Exploration for Nonepisodic RL

TL;DR

with high probability. The method leverages calibrated uncertainty models, MPC-based planning, and a theoretical framework ensuring ergodicity and stability, while empirical results show convergence to the optimal average cost across diverse, high-dimensional environments with limited interactions. This work advances the practical and theoretical foundations of nonepisodic deep RL by enabling stable exploration without resets and providing concrete regret guarantees for nonlinear systems.

Abstract

for general nonlinear systems with Gaussian process dynamics. We compare NeoRL to other baselines on several deep RL environments and empirically demonstrate that NeoRL achieves the optimal average cost while incurring the least regret.

Paper Structure (27 sections, 19 theorems, 107 equations, 2 figures, 3 tables, 2 algorithms)

This paper contains 27 sections, 19 theorems, 107 equations, 2 figures, 3 tables, 2 algorithms.

Introduction
Contributions
Problem Setting
Task
Assumptions
NeoRL
Picking the horizon $H_n$
Theoretical Results
Proof sketch
Practical Modifications
Experiments
Baselines
Convergence to the optimal average cost
Calling reset when needed
Related Work
...and 12 more sections

Key Result

Lemma 2.5

Assume ${\bm{f}}^*$ is uniformly continuous and for all $\bm{\pi} \in \Pi$, ${\bm{x}} \in {\mathcal{X}}$, $\left\| \bm{\pi}({\bm{x}}) \right\| \leq u_{\max}$. Further assume, there exists $\bm{\pi}_s \in \Pi$ such that we have constants $K, C_u, C_l$ with $C_u > C_l$, $\gamma \in (0, 1)$, $\kappa, \ where ${\bm{x}}_+ = {\bm{f}}^*({\bm{x}}, \bm{\pi}({\bm{x}})) + {\bm{w}}$. Then, $V$ also satisfies

Figures (2)

Figure 1: Average reward $A(\bm{\pi})$ and cumulative regret $R_T$ over ten different seeds for all environments. We report the mean performance with one standard error as shaded regions. During all experiments, the environment is never reset. For all baselines, we model the dynamics with probabilistic ensembles, except in the Pendulum-GP experiment, where GPs are used instead. NeoRL significantly outperforms all baselines and converges to the optimal average reward, $A(\bm{\pi}^*) = 0$, showing sublinear cumulative regret $R_T$ for all environments.
Figure 2: Total number of resets and cumulative regret $R_T$ for the cart pole balancing task over ten different seeds. We report the mean performance with one standard errors as the shaded region. The environment is automatically reset whenever the agent drops the pole. All baselines solve the task, but NeoRL converges the fastest requiring fewer resets and suffering smaller regret.

Theorems & Definitions (39)

Definition 2.3: ${\mathcal{K}}_{\infty}$-functions
Lemma 2.5
Theorem 2.6: Existence of Average Cost Solution
Definition 2.7: Well-calibrated statistical model of ${\bm{f}}^*$, rothfuss2023hallucinated
Lemma 2.9: Well calibrated confidence intervals for RKHS, rothfuss2023hallucinated
Theorem 3.1: Cumulative Regret of NeoRL
Lemma A.1
Corollary A.2: Lower bound on the posterior log determinant
proof
Corollary A.3: Upper bound on the posterior log determinant
...and 29 more

NeoRL: Efficient Exploration for Nonepisodic RL

TL;DR

Abstract

NeoRL: Efficient Exploration for Nonepisodic RL

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (39)