Table of Contents
Fetching ...

Probabilistic Shielding for Safe Reinforcement Learning

Edwin Hamel-De le Court, Francesco Belardinelli, Alexander W. Goodall

TL;DR

This paper tackles Safe Reinforcement Learning when safety is defined as avoiding unsafe states with an undiscounted probabilistic constraint, assuming known safety dynamics. It introduces probabilistic shielding, which augments the MDP with a safety level and constructs a shield Sh^{≤ p}_{β}(M) that enforces safety via a beta-bound computed from minimal reachability to unsafe states, using value-iteration-based safety costs rather than linear programming. The shield can be implemented as a gym environment En_{β}^{≤ p}(M) and used with standard RL algorithms (e.g., PPO), yielding safety guarantees during training and testing and, under mild conditions, near-optimal reward performance relative to the RCOP objective. Empirical evaluation across five environments (media streaming, two Colour bomb gridworlds, two Bridge crossing variants, and Pacman) demonstrates that PPO-shield guarantees safety and often matches or surpass the performance of baselines like CPO and PPO-Lagrangian, illustrating scalability and practical impact for safe RL applications. Overall, the approach provides a scalable, formally grounded alternative to LP-based Safe RL, enabling strict constraint satisfaction with competitive reward optimization in real-world-like settings.

Abstract

In real-life scenarios, a Reinforcement Learning (RL) agent aiming to maximise their reward, must often also behave in a safe manner, including at training time. Thus, much attention in recent years has been given to Safe RL, where an agent aims to learn an optimal policy among all policies that satisfy a given safety constraint. However, strict safety guarantees are often provided through approaches based on linear programming, and thus have limited scaling. In this paper we present a new, scalable method, which enjoys strict formal guarantees for Safe RL, in the case where the safety dynamics of the Markov Decision Process (MDP) are known, and safety is defined as an undiscounted probabilistic avoidance property. Our approach is based on state-augmentation of the MDP, and on the design of a shield that restricts the actions available to the agent. We show that our approach provides a strict formal safety guarantee that the agent stays safe at training and test time. Furthermore, we demonstrate that our approach is viable in practice through experimental evaluation.

Probabilistic Shielding for Safe Reinforcement Learning

TL;DR

This paper tackles Safe Reinforcement Learning when safety is defined as avoiding unsafe states with an undiscounted probabilistic constraint, assuming known safety dynamics. It introduces probabilistic shielding, which augments the MDP with a safety level and constructs a shield Sh^{≤ p}_{β}(M) that enforces safety via a beta-bound computed from minimal reachability to unsafe states, using value-iteration-based safety costs rather than linear programming. The shield can be implemented as a gym environment En_{β}^{≤ p}(M) and used with standard RL algorithms (e.g., PPO), yielding safety guarantees during training and testing and, under mild conditions, near-optimal reward performance relative to the RCOP objective. Empirical evaluation across five environments (media streaming, two Colour bomb gridworlds, two Bridge crossing variants, and Pacman) demonstrates that PPO-shield guarantees safety and often matches or surpass the performance of baselines like CPO and PPO-Lagrangian, illustrating scalability and practical impact for safe RL applications. Overall, the approach provides a scalable, formally grounded alternative to LP-based Safe RL, enabling strict constraint satisfaction with competitive reward optimization in real-world-like settings.

Abstract

In real-life scenarios, a Reinforcement Learning (RL) agent aiming to maximise their reward, must often also behave in a safe manner, including at training time. Thus, much attention in recent years has been given to Safe RL, where an agent aims to learn an optimal policy among all policies that satisfy a given safety constraint. However, strict safety guarantees are often provided through approaches based on linear programming, and thus have limited scaling. In this paper we present a new, scalable method, which enjoys strict formal guarantees for Safe RL, in the case where the safety dynamics of the Markov Decision Process (MDP) are known, and safety is defined as an undiscounted probabilistic avoidance property. Our approach is based on state-augmentation of the MDP, and on the design of a shield that restricts the actions available to the agent. We show that our approach provides a strict formal safety guarantee that the agent stays safe at training and test time. Furthermore, we demonstrate that our approach is viable in practice through experimental evaluation.

Paper Structure

This paper contains 32 sections, 5 theorems, 19 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

For any memoryless policy $\pi$ in $\text{Sh}^{\leq p}_{\beta}\left(\mathcal{M}\right)$, we have

Figures (3)

  • Figure 1: Gridworld Environments
  • Figure 2: Learning curves
  • Figure 3: Learning curves for additional experiments

Theorems & Definitions (10)

  • Definition 1: Reachability-Constrained Optimization Problem (RCOP)
  • Definition 2
  • Definition 3: The Shield
  • Theorem 1: Safety guarantee in any shield
  • Corollary 1: Safety guarantee in the original MDP
  • Theorem 2: Optimality-preserving guarantees
  • Theorem 1: Safety guarantee in any shield
  • proof
  • Theorem 2: Optimality-preserving guarantees
  • proof