Table of Contents
Fetching ...

Real-DRL: Teach and Learn in Reality

Yanbing Mao, Yihao Cai, Lui Sha

TL;DR

Real-DRL tackles the challenge of safe, runtime reinforcement learning for safety-critical autonomous systems by integrating a data-driven DRL-Student with a physics-model-based PHY-Teacher and a Trigger that manages their interaction. The framework introduces safety-informed batch sampling and dual replay buffers to address corner-case learning while maintaining safety, and it achieves assured safety under unknown unknowns and Sim2Real gaps through real-time patching and LMIs. The key contributions are the dual-learning paradigm, the safety-first hierarchy, and the adaptive safety mechanism, validated through real-robot and simulated benchmarks showing robust safety guarantees alongside competitive performance. This approach has practical impact for deploying learning-enabled controllers in the real world where safety cannot be sacrificed, enabling faster runtime qualification and safer autonomous operation across complex environments.

Abstract

This paper introduces the Real-DRL framework for safety-critical autonomous systems, enabling runtime learning of a deep reinforcement learning (DRL) agent to develop safe and high-performance action policies in real plants (i.e., real physical systems to be controlled), while prioritizing safety! The Real-DRL consists of three interactive components: a DRL-Student, a PHY-Teacher, and a Trigger. The DRL-Student is a DRL agent that innovates in the dual self-learning and teaching-to-learn paradigm and the real-time safety-informed batch sampling. On the other hand, PHY-Teacher is a physics-model-based design of action policies that focuses solely on safety-critical functions. PHY-Teacher is novel in its real-time patch for two key missions: i) fostering the teaching-to-learn paradigm for DRL-Student and ii) backing up the safety of real plants. The Trigger manages the interaction between the DRL-Student and the PHY-Teacher. Powered by the three interactive components, the Real-DRL can effectively address safety challenges that arise from the unknown unknowns and the Sim2Real gap. Additionally, Real-DRL notably features i) assured safety, ii) automatic hierarchy learning (i.e., safety-first learning and then high-performance learning), and iii) safety-informed batch sampling to address the learning experience imbalance caused by corner cases. Experiments with a real quadruped robot, a quadruped robot in NVIDIA Isaac Gym, and a cart-pole system, along with comparisons and ablation studies, demonstrate the Real-DRL's effectiveness and unique features.

Real-DRL: Teach and Learn in Reality

TL;DR

Real-DRL tackles the challenge of safe, runtime reinforcement learning for safety-critical autonomous systems by integrating a data-driven DRL-Student with a physics-model-based PHY-Teacher and a Trigger that manages their interaction. The framework introduces safety-informed batch sampling and dual replay buffers to address corner-case learning while maintaining safety, and it achieves assured safety under unknown unknowns and Sim2Real gaps through real-time patching and LMIs. The key contributions are the dual-learning paradigm, the safety-first hierarchy, and the adaptive safety mechanism, validated through real-robot and simulated benchmarks showing robust safety guarantees alongside competitive performance. This approach has practical impact for deploying learning-enabled controllers in the real world where safety cannot be sacrificed, enabling faster runtime qualification and safer autonomous operation across complex environments.

Abstract

This paper introduces the Real-DRL framework for safety-critical autonomous systems, enabling runtime learning of a deep reinforcement learning (DRL) agent to develop safe and high-performance action policies in real plants (i.e., real physical systems to be controlled), while prioritizing safety! The Real-DRL consists of three interactive components: a DRL-Student, a PHY-Teacher, and a Trigger. The DRL-Student is a DRL agent that innovates in the dual self-learning and teaching-to-learn paradigm and the real-time safety-informed batch sampling. On the other hand, PHY-Teacher is a physics-model-based design of action policies that focuses solely on safety-critical functions. PHY-Teacher is novel in its real-time patch for two key missions: i) fostering the teaching-to-learn paradigm for DRL-Student and ii) backing up the safety of real plants. The Trigger manages the interaction between the DRL-Student and the PHY-Teacher. Powered by the three interactive components, the Real-DRL can effectively address safety challenges that arise from the unknown unknowns and the Sim2Real gap. Additionally, Real-DRL notably features i) assured safety, ii) automatic hierarchy learning (i.e., safety-first learning and then high-performance learning), and iii) safety-informed batch sampling to address the learning experience imbalance caused by corner cases. Experiments with a real quadruped robot, a quadruped robot in NVIDIA Isaac Gym, and a cart-pole system, along with comparisons and ablation studies, demonstrate the Real-DRL's effectiveness and unique features.

Paper Structure

This paper contains 57 sections, 7 theorems, 104 equations, 12 figures, 3 tables, 3 algorithms.

Key Result

Theorem 4.1

Consider the safety set $\mathbb{S}$aset2 and the function $V(\mathbf{s})$ssind with its matrix $\mathbf{P}$ computed in ssind2. The ellipsoid set $\left\{ {\left. \mathbf{s} \in {\mathbb{R}^n} ~\right|} \right. V(\mathbf{s}) < 1\} \subseteq \mathbb{S}$ and it has the maximum volume.

Figures (12)

  • Figure 1: Real-DRL framework, which is also formally described in \ref{['realdrl']} of \ref{['pesudocoderealdrl']}.
  • Figure 2: An illustration example of safety-informed batch sampling, given batch size $L = 5$, $\rho_1 = 1$, and $\rho_2 = 0$.
  • Figure 3: Phase plots.
  • Figure 4: Trajectories of episode return.
  • Figure 5: Phase plots, given five random initial states within self-learning space.
  • ...and 7 more figures

Theorems & Definitions (15)

  • Remark 2.1: Buffer Zone
  • Definition 2.2: Safe Action Policy of DRL-Student
  • Definition 2.3: Safe Action Policy of PHY-Teacher
  • Theorem 4.1
  • Theorem 5.2: Real-Time Patch
  • Remark 5.3: Policy Computation
  • Remark 5.4: Adaptive Mechanism for Assuring Validity of \ref{['assm']}
  • Lemma A.1: Schur Complement zhang2006schur
  • Lemma A.2
  • Lemma A.3
  • ...and 5 more