Dynamic Model Predictive Shielding for Provably Safe Reinforcement Learning

Arko Banerjee; Kia Rahmani; Joydeep Biswas; Isil Dillig

Dynamic Model Predictive Shielding for Provably Safe Reinforcement Learning

Arko Banerjee, Kia Rahmani, Joydeep Biswas, Isil Dillig

TL;DR

DMPS integrates a dynamic, planner-driven recovery mechanism with a neural policy to achieve provably safe reinforcement learning in high-dimensional control tasks. By using a local planner to identify recovery actions that maximize short-term progress while considering long-term value, DMPS reduces shield usage and improves final performance, with recovery regret decaying exponentially with planning horizon $n$. The approach yields safety during training and deployment, and empirical results across static and dynamic benchmarks show DMPS outperforming MPS and several SRL/PSRL baselines, particularly in dynamic environments. Key theoretical guarantees include RR = O($\gamma^n$) under appropriate planner properties, and a training loop that enables the policy to imitate safe planner actions while learning task-driven behavior. The work advances safe RL by fusing planning and learning, achieving practical safety with improved learning efficiency and performance in continuous domains.

Abstract

Among approaches for provably safe reinforcement learning, Model Predictive Shielding (MPS) has proven effective at complex tasks in continuous, high-dimensional state spaces, by leveraging a backup policy to ensure safety when the learned policy attempts to take risky actions. However, while MPS can ensure safety both during and after training, it often hinders task progress due to the conservative and task-oblivious nature of backup policies. This paper introduces Dynamic Model Predictive Shielding (DMPS), which optimizes reinforcement learning objectives while maintaining provable safety. DMPS employs a local planner to dynamically select safe recovery actions that maximize both short-term progress as well as long-term rewards. Crucially, the planner and the neural policy play a synergistic role in DMPS. When planning recovery actions for ensuring safety, the planner utilizes the neural policy to estimate long-term rewards, allowing it to observe beyond its short-term planning horizon. Conversely, the neural policy under training learns from the recovery plans proposed by the planner, converging to policies that are both high-performing and safe in practice. This approach guarantees safety during and after training, with bounded recovery regret that decreases exponentially with planning horizon depth. Experimental results demonstrate that DMPS converges to policies that rarely require shield interventions after training and achieve higher rewards compared to several state-of-the-art baselines.

Dynamic Model Predictive Shielding for Provably Safe Reinforcement Learning

TL;DR

. The approach yields safety during training and deployment, and empirical results across static and dynamic benchmarks show DMPS outperforming MPS and several SRL/PSRL baselines, particularly in dynamic environments. Key theoretical guarantees include RR = O(

) under appropriate planner properties, and a training loop that enables the policy to imitate safe planner actions while learning task-driven behavior. The work advances safe RL by fusing planning and learning, achieving practical safety with improved learning efficiency and performance in continuous domains.

Abstract

Paper Structure (25 sections, 6 theorems, 31 equations, 8 figures, 2 tables, 3 algorithms)

This paper contains 25 sections, 6 theorems, 31 equations, 8 figures, 2 tables, 3 algorithms.

Introduction
Related Work
Preliminaries
Model Predictive Shielding
Recovery Regret
Dynamic Model Predictive Shielding
Planning Optimal Recovery
Training Algorithm
Experiments
Safety Results
Performance Results
Analysis
Limitations
Determinism
Computational Overhead
...and 10 more sections

Key Result

Theorem 5.1

Extended theorem statement and proof are provided in Appendix app:proof. (Simplified) Suppose the use of a probabilistically complete and asymptotically optimal planner with planning horizon $n$ and sampling limit $m.$ Under mild assumptions of the MDP, the recovery regret of policy $\pi^*_{\textnor

Figures (8)

Figure 1: Overview of an execution cycle in MPS (➊, ➋) and DMPS (➊, ➋, ➌, ➍).
Figure 2: (a) Unsafe trajectory leading to a collision. (b) Safe but sub-optimal trajectory. (c) Optimal and safe trajectory. (d) An instance of the planning phase.
Figure 3: Shield Invocations in double-gate and double-gate+
Figure 4: Episodic Returns in double-gate and double-gate+
Figure 5: Example trajectories in double-gate+ .
...and 3 more figures

Theorems & Definitions (10)

Theorem 5.1
Theorem A.1
Lemma A.2
proof
Lemma A.3
proof
Lemma A.4
proof
Lemma A.5
proof

Dynamic Model Predictive Shielding for Provably Safe Reinforcement Learning

TL;DR

Abstract

Dynamic Model Predictive Shielding for Provably Safe Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (10)