Table of Contents
Fetching ...

NeoRL: Efficient Exploration for Nonepisodic RL

Bhavya Sukhija, Lenart Treven, Florian Dörfler, Stelian Coros, Andreas Krause

TL;DR

NeoRL addresses nonepisodic reinforcement learning for unknown nonlinear dynamics learned from a single trajectory. It introduces a model-based optimistic approach that plans with epistemic uncertainty and uses RKHS/GP dynamics with a horizon scheduling rule to achieve sublinear regret, proving $R_T \le C(\mathbf{x}_0, K, \gamma) \Gamma_T \sqrt{T}$ with high probability. The method leverages calibrated uncertainty models, MPC-based planning, and a theoretical framework ensuring ergodicity and stability, while empirical results show convergence to the optimal average cost across diverse, high-dimensional environments with limited interactions. This work advances the practical and theoretical foundations of nonepisodic deep RL by enabling stable exploration without resets and providing concrete regret guarantees for nonlinear systems.

Abstract

We study the problem of nonepisodic reinforcement learning (RL) for nonlinear dynamical systems, where the system dynamics are unknown and the RL agent has to learn from a single trajectory, i.e., without resets. We propose Nonepisodic Optimistic RL (NeoRL), an approach based on the principle of optimism in the face of uncertainty. NeoRL uses well-calibrated probabilistic models and plans optimistically w.r.t. the epistemic uncertainty about the unknown dynamics. Under continuity and bounded energy assumptions on the system, we provide a first-of-its-kind regret bound of $O(Γ_T \sqrt{T})$ for general nonlinear systems with Gaussian process dynamics. We compare NeoRL to other baselines on several deep RL environments and empirically demonstrate that NeoRL achieves the optimal average cost while incurring the least regret.

NeoRL: Efficient Exploration for Nonepisodic RL

TL;DR

NeoRL addresses nonepisodic reinforcement learning for unknown nonlinear dynamics learned from a single trajectory. It introduces a model-based optimistic approach that plans with epistemic uncertainty and uses RKHS/GP dynamics with a horizon scheduling rule to achieve sublinear regret, proving with high probability. The method leverages calibrated uncertainty models, MPC-based planning, and a theoretical framework ensuring ergodicity and stability, while empirical results show convergence to the optimal average cost across diverse, high-dimensional environments with limited interactions. This work advances the practical and theoretical foundations of nonepisodic deep RL by enabling stable exploration without resets and providing concrete regret guarantees for nonlinear systems.

Abstract

We study the problem of nonepisodic reinforcement learning (RL) for nonlinear dynamical systems, where the system dynamics are unknown and the RL agent has to learn from a single trajectory, i.e., without resets. We propose Nonepisodic Optimistic RL (NeoRL), an approach based on the principle of optimism in the face of uncertainty. NeoRL uses well-calibrated probabilistic models and plans optimistically w.r.t. the epistemic uncertainty about the unknown dynamics. Under continuity and bounded energy assumptions on the system, we provide a first-of-its-kind regret bound of for general nonlinear systems with Gaussian process dynamics. We compare NeoRL to other baselines on several deep RL environments and empirically demonstrate that NeoRL achieves the optimal average cost while incurring the least regret.
Paper Structure (27 sections, 19 theorems, 107 equations, 2 figures, 3 tables, 2 algorithms)

This paper contains 27 sections, 19 theorems, 107 equations, 2 figures, 3 tables, 2 algorithms.

Key Result

Lemma 2.5

Assume ${\bm{f}}^*$ is uniformly continuous and for all $\bm{\pi} \in \Pi$, ${\bm{x}} \in {\mathcal{X}}$, $\left\| \bm{\pi}({\bm{x}}) \right\| \leq u_{\max}$. Further assume, there exists $\bm{\pi}_s \in \Pi$ such that we have constants $K, C_u, C_l$ with $C_u > C_l$, $\gamma \in (0, 1)$, $\kappa, \ where ${\bm{x}}_+ = {\bm{f}}^*({\bm{x}}, \bm{\pi}({\bm{x}})) + {\bm{w}}$. Then, $V$ also satisfies

Figures (2)

  • Figure 1: Average reward $A(\bm{\pi})$ and cumulative regret $R_T$ over ten different seeds for all environments. We report the mean performance with one standard error as shaded regions. During all experiments, the environment is never reset. For all baselines, we model the dynamics with probabilistic ensembles, except in the Pendulum-GP experiment, where GPs are used instead. NeoRL significantly outperforms all baselines and converges to the optimal average reward, $A(\bm{\pi}^*) = 0$, showing sublinear cumulative regret $R_T$ for all environments.
  • Figure 2: Total number of resets and cumulative regret $R_T$ for the cart pole balancing task over ten different seeds. We report the mean performance with one standard errors as the shaded region. The environment is automatically reset whenever the agent drops the pole. All baselines solve the task, but NeoRL converges the fastest requiring fewer resets and suffering smaller regret.

Theorems & Definitions (39)

  • Definition 2.3: ${\mathcal{K}}_{\infty}$-functions
  • Lemma 2.5
  • Theorem 2.6: Existence of Average Cost Solution
  • Definition 2.7: Well-calibrated statistical model of ${\bm{f}}^*$, rothfuss2023hallucinated
  • Lemma 2.9: Well calibrated confidence intervals for RKHS, rothfuss2023hallucinated
  • Theorem 3.1: Cumulative Regret of NeoRL
  • Lemma A.1
  • Corollary A.2: Lower bound on the posterior log determinant
  • proof
  • Corollary A.3: Upper bound on the posterior log determinant
  • ...and 29 more