Table of Contents
Fetching ...

Learning-Based Shielding for Safe Autonomy under Unknown Dynamics

Robert Reed, Morteza Lahijanian

TL;DR

This paper develops an algorithm that computes the maximally permissive set of safe policies on the IMDP, ensuring avoidance of unsafe states and develops an approach that leverages Deep Kernel Learning to model the systems’ one-step evolution with uncertainty quantification.

Abstract

Shielding is a common method used to guarantee the safety of a system under a black-box controller, such as a neural network controller from deep reinforcement learning (DRL), with simpler, verified controllers. Existing shielding methods rely on formal verification through Markov Decision Processes (MDPs), assuming either known or finite-state models, which limits their applicability to DRL settings with unknown, continuous-state systems. This paper addresses these limitations by proposing a data-driven shielding methodology that guarantees safety for unknown systems under black-box controllers. The approach leverages Deep Kernel Learning to model the systems' one-step evolution with uncertainty quantification and constructs a finite-state abstraction as an Interval MDP (IMDP). By focusing on safety properties expressed in safe linear temporal logic (safe LTL), we develop an algorithm that computes the maximally permissive set of safe policies on the IMDP, ensuring avoidance of unsafe states. The algorithms soundness and computational complexity are demonstrated through theoretical proofs and experiments on nonlinear systems, including a high-dimensional autonomous spacecraft scenario.

Learning-Based Shielding for Safe Autonomy under Unknown Dynamics

TL;DR

This paper develops an algorithm that computes the maximally permissive set of safe policies on the IMDP, ensuring avoidance of unsafe states and develops an approach that leverages Deep Kernel Learning to model the systems’ one-step evolution with uncertainty quantification.

Abstract

Shielding is a common method used to guarantee the safety of a system under a black-box controller, such as a neural network controller from deep reinforcement learning (DRL), with simpler, verified controllers. Existing shielding methods rely on formal verification through Markov Decision Processes (MDPs), assuming either known or finite-state models, which limits their applicability to DRL settings with unknown, continuous-state systems. This paper addresses these limitations by proposing a data-driven shielding methodology that guarantees safety for unknown systems under black-box controllers. The approach leverages Deep Kernel Learning to model the systems' one-step evolution with uncertainty quantification and constructs a finite-state abstraction as an Interval MDP (IMDP). By focusing on safety properties expressed in safe linear temporal logic (safe LTL), we develop an algorithm that computes the maximally permissive set of safe policies on the IMDP, ensuring avoidance of unsafe states. The algorithms soundness and computational complexity are demonstrated through theoretical proofs and experiments on nonlinear systems, including a high-dimensional autonomous spacecraft scenario.

Paper Structure

This paper contains 13 sections, 5 theorems, 15 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

Let $B_i \geq \|f^{(i)}\|$, $G = (K_{\mathcal{X},\mathcal{X}} + \sigma_n^2I)^{-1}$, $W_x = K_{x,\mathcal{X}}G$, $\gamma \leq f^{(i)}(\mathcal{X})^T G f^{(i)}(\mathcal{X})$, and $\lambda_x = 4\sigma_v^2K_{x,\mathcal{X}}G^2 K_{\mathcal{X},x}.$ Then if $\|\mathbf{v}\|_{\infty} \leq \sigma_v$, where $\epsilon^{(i)}(x,\delta) = \sigma_{D}^{(i)}(x)\sqrt{B_i^2 - \gamma} + \sqrt{\frac{\lambda_x}{2}\text{

Figures (3)

  • Figure 1: Post-Posed Shielded RL architecture. In our scenario, the shield can be taken as $\Pi_S$.
  • Figure 2: A visualization of the number of actions available at each state that can guarantee safety from the initial DFA state after using Algorithm \ref{['alg:action_removal_2']} on an IMDP generated from data. Dark green: all actions can be guaranteed safe to White: no action can be guaranteed safe for the given specification as shown in the legend on the right.
  • Figure 3: A histogram of Basilisk simulation end times for 1000 LEO trajectories with a shield generated by Algorithm \ref{['alg:action_removal_2']} (green) and without a shield (red).

Theorems & Definitions (14)

  • Definition 1: MDP
  • Definition 2: Safe LTL
  • Remark 1
  • Definition 3: IMDP
  • Proposition 1: reed2024error, Theorem 1
  • Proposition 2: skovbekk2023formaljackson2021formal, Theorems 1
  • Definition 4: DFA
  • Definition 5: Product IMDP
  • Theorem 1: Run Time Complexity
  • proof
  • ...and 4 more