Table of Contents
Fetching ...

Physics-model-guided Worst-case Sampling for Safe Reinforcement Learning

Hongpeng Cao, Yanbing Mao, Lui Sha, Marco Caccamo

TL;DR

This work tackles the challenge of ensuring safety for reinforcement learning in cyber-physical systems by focusing training on safety-critical corner cases. It introduces a physics-model-guided worst-case sampling strategy that defines worst-case states on the boundary of a Lyapunov-based safety envelope $\\Omega$ (where $\\mathbf{s}^T\\mathbf{P}\\mathbf{s}=1$) and integrates this with the Phy-DRL framework, yielding a residual action policy that combines a data-driven term with a physics-based corrective term. The authors provide a practical algorithm to generate boundary states using spectral and spherical parameterizations, and demonstrate through cart-pole, 2D quadrotor, and quadruped experiments that worst-case sampling substantially improves safety guarantees and sampling efficiency compared to random sampling, with a scalable training curriculum. They also discuss architectural safeguards, such as Simplex-based fault tolerance and monitoring, to mitigate potential boundary-induced instability and to enable safe deployment in real-world systems. Overall, the approach delivers more robust safe policies and more data-efficient training for safety-critical CPS.

Abstract

Real-world accidents in learning-enabled CPS frequently occur in challenging corner cases. During the training of deep reinforcement learning (DRL) policy, the standard setup for training conditions is either fixed at a single initial condition or uniformly sampled from the admissible state space. This setup often overlooks the challenging but safety-critical corner cases. To bridge this gap, this paper proposes a physics-model-guided worst-case sampling strategy for training safe policies that can handle safety-critical cases toward guaranteed safety. Furthermore, we integrate the proposed worst-case sampling strategy into the physics-regulated deep reinforcement learning (Phy-DRL) framework to build a more data-efficient and safe learning algorithm for safety-critical CPS. We validate the proposed training strategy with Phy-DRL through extensive experiments on a simulated cart-pole system, a 2D quadrotor, a simulated and a real quadruped robot, showing remarkably improved sampling efficiency to learn more robust safe policies.

Physics-model-guided Worst-case Sampling for Safe Reinforcement Learning

TL;DR

This work tackles the challenge of ensuring safety for reinforcement learning in cyber-physical systems by focusing training on safety-critical corner cases. It introduces a physics-model-guided worst-case sampling strategy that defines worst-case states on the boundary of a Lyapunov-based safety envelope (where ) and integrates this with the Phy-DRL framework, yielding a residual action policy that combines a data-driven term with a physics-based corrective term. The authors provide a practical algorithm to generate boundary states using spectral and spherical parameterizations, and demonstrate through cart-pole, 2D quadrotor, and quadruped experiments that worst-case sampling substantially improves safety guarantees and sampling efficiency compared to random sampling, with a scalable training curriculum. They also discuss architectural safeguards, such as Simplex-based fault tolerance and monitoring, to mitigate potential boundary-induced instability and to enable safe deployment in real-world systems. Overall, the approach delivers more robust safe policies and more data-efficient training for safety-critical CPS.

Abstract

Real-world accidents in learning-enabled CPS frequently occur in challenging corner cases. During the training of deep reinforcement learning (DRL) policy, the standard setup for training conditions is either fixed at a single initial condition or uniformly sampled from the admissible state space. This setup often overlooks the challenging but safety-critical corner cases. To bridge this gap, this paper proposes a physics-model-guided worst-case sampling strategy for training safe policies that can handle safety-critical cases toward guaranteed safety. Furthermore, we integrate the proposed worst-case sampling strategy into the physics-regulated deep reinforcement learning (Phy-DRL) framework to build a more data-efficient and safe learning algorithm for safety-critical CPS. We validate the proposed training strategy with Phy-DRL through extensive experiments on a simulated cart-pole system, a 2D quadrotor, a simulated and a real quadruped robot, showing remarkably improved sampling efficiency to learn more robust safe policies.

Paper Structure

This paper contains 19 sections, 2 theorems, 25 equations, 10 figures, 2 tables, 1 algorithm.

Key Result

Lemma 3.2

Given $\mathbf{P} \succ 0$, the solution of $\mathbf{s} \in \mathbb{R}^{n}$, being subject to ${\mathbf{s}^\top}\cdot {\mathbf{P}} \cdot \mathbf{s} = \varphi$, is where $\mathbf{Q}(\mathbf{P})$ is $\mathbf{P}$'s orthogonal matrix, and ${\lambda_i}( \mathbf{P})$ is the $i$-th eigenvalue of matrix $\mathbf{P} \in \mathbb{R}^{n \times n}$.

Figures (10)

  • Figure 1: Phy-DRL training powered by periodic and sparse worst-case sampling for safety-critical CPS.
  • Figure 2: Worst-case condition generation in for a three-dimensional $(n=3)$ safety envelope.
  • Figure 3: Worst-case Sampling v.s. Random Sampling, with termination condition. Blue: area of IE samples \ref{['ies']}. Green: area of EE samples \ref{['ees']}. Ellipse area: safety envelope. The (a) and (b) are the testing result visualized on $x$ and $\theta$ dimensions, where (c) and (d) are the results visualized on $v$ and $w$ dimensions. The size of colored area indicates the safety and robustness of the learned policy, the larger the better.
  • Figure 4: Worst-case Sampling v.s. Random Sampling, without using termination condition in training.
  • Figure 5: (a)-(c): The number and locations of IE samples \ref{['ies']} visualized in $x-y-\theta$ space. $\text{Phy-DRL}_{\text{wc}}$ has much more colored points, meaning that it can almost render the safety envelope invariant. (d): Reward curves (five random seeds): $\text{Phy-DRL}_{\text{wc}}$ v.s. $\text{Phy-DRL}_{\text{ran}}$.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Definition 2.1
  • Definition 3.1: Worst-case conditions
  • Lemma 3.2
  • Lemma A.1: Positive Definiteness bhatia2009positive