Physics-model-guided Worst-case Sampling for Safe Reinforcement Learning
Hongpeng Cao, Yanbing Mao, Lui Sha, Marco Caccamo
TL;DR
This work tackles the challenge of ensuring safety for reinforcement learning in cyber-physical systems by focusing training on safety-critical corner cases. It introduces a physics-model-guided worst-case sampling strategy that defines worst-case states on the boundary of a Lyapunov-based safety envelope $\\Omega$ (where $\\mathbf{s}^T\\mathbf{P}\\mathbf{s}=1$) and integrates this with the Phy-DRL framework, yielding a residual action policy that combines a data-driven term with a physics-based corrective term. The authors provide a practical algorithm to generate boundary states using spectral and spherical parameterizations, and demonstrate through cart-pole, 2D quadrotor, and quadruped experiments that worst-case sampling substantially improves safety guarantees and sampling efficiency compared to random sampling, with a scalable training curriculum. They also discuss architectural safeguards, such as Simplex-based fault tolerance and monitoring, to mitigate potential boundary-induced instability and to enable safe deployment in real-world systems. Overall, the approach delivers more robust safe policies and more data-efficient training for safety-critical CPS.
Abstract
Real-world accidents in learning-enabled CPS frequently occur in challenging corner cases. During the training of deep reinforcement learning (DRL) policy, the standard setup for training conditions is either fixed at a single initial condition or uniformly sampled from the admissible state space. This setup often overlooks the challenging but safety-critical corner cases. To bridge this gap, this paper proposes a physics-model-guided worst-case sampling strategy for training safe policies that can handle safety-critical cases toward guaranteed safety. Furthermore, we integrate the proposed worst-case sampling strategy into the physics-regulated deep reinforcement learning (Phy-DRL) framework to build a more data-efficient and safe learning algorithm for safety-critical CPS. We validate the proposed training strategy with Phy-DRL through extensive experiments on a simulated cart-pole system, a 2D quadrotor, a simulated and a real quadruped robot, showing remarkably improved sampling efficiency to learn more robust safe policies.
