Table of Contents
Fetching ...

Leveraging Approximate Model-based Shielding for Probabilistic Safety Guarantees in Continuous Environments

Alexander W. Goodall, Francesco Belardinelli

TL;DR

The paper tackles safe reinforcement learning in continuous environments by extending Approximate Model-based Shielding (AMBS) to continuous state and action spaces and evaluating it on Safety Gym with DreamerV3 as the world model. It introduces three penalty-based gradient modification techniques—PENL, PLPG, and COPT—to inject safety considerations into policy optimization while avoiding the drawbacks of rejection-based shielding. The authors establish probabilistic safety guarantees for the continuous setting via sample complexity bounds under full and partial observability, and demonstrate dramatic reductions in safety violations across multiple Safety Gym tasks, albeit with slower convergence than some baselines. This work advances practical safe RL by enabling tunable safety guarantees and improved stability in continuous domains, which is critical for real-world deployment of model-based RL systems.

Abstract

Shielding is a popular technique for achieving safe reinforcement learning (RL). However, classical shielding approaches come with quite restrictive assumptions making them difficult to deploy in complex environments, particularly those with continuous state or action spaces. In this paper we extend the more versatile approximate model-based shielding (AMBS) framework to the continuous setting. In particular we use Safety Gym as our test-bed, allowing for a more direct comparison of AMBS with popular constrained RL algorithms. We also provide strong probabilistic safety guarantees for the continuous setting. In addition, we propose two novel penalty techniques that directly modify the policy gradient, which empirically provide more stable convergence in our experiments.

Leveraging Approximate Model-based Shielding for Probabilistic Safety Guarantees in Continuous Environments

TL;DR

The paper tackles safe reinforcement learning in continuous environments by extending Approximate Model-based Shielding (AMBS) to continuous state and action spaces and evaluating it on Safety Gym with DreamerV3 as the world model. It introduces three penalty-based gradient modification techniques—PENL, PLPG, and COPT—to inject safety considerations into policy optimization while avoiding the drawbacks of rejection-based shielding. The authors establish probabilistic safety guarantees for the continuous setting via sample complexity bounds under full and partial observability, and demonstrate dramatic reductions in safety violations across multiple Safety Gym tasks, albeit with slower convergence than some baselines. This work advances practical safe RL by enabling tunable safety guarantees and improved stability in continuous domains, which is critical for real-world deployment of model-based RL systems.

Abstract

Shielding is a popular technique for achieving safe reinforcement learning (RL). However, classical shielding approaches come with quite restrictive assumptions making them difficult to deploy in complex environments, particularly those with continuous state or action spaces. In this paper we extend the more versatile approximate model-based shielding (AMBS) framework to the continuous setting. In particular we use Safety Gym as our test-bed, allowing for a more direct comparison of AMBS with popular constrained RL algorithms. We also provide strong probabilistic safety guarantees for the continuous setting. In addition, we propose two novel penalty techniques that directly modify the policy gradient, which empirically provide more stable convergence in our experiments.
Paper Structure (38 sections, 7 theorems, 40 equations, 9 figures, 4 tables, 2 algorithms)

This paper contains 38 sections, 7 theorems, 40 equations, 9 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Let $\epsilon > 0$, $\delta > 0$, $s \in S$ be given. With access to the true transition system $\mathcal{T}$, with probability $1 - \delta$ we can obtain an $\epsilon$-approximate estimate of the measure $\mu_{s\models\phi}$, by sampling $m$ traces $\tau \sim \mathcal{T}$, provided that,

Figures (9)

  • Figure 1: A simple example in Safety Gym ray2019benchmarking. The task policy proposes actions along the optimal trajectory. However, this trajectory enters an unsafe region and so the shield overrides these actions with "Break!" actions proposed by the safe policy. As a result, the safe trajectory is not followed and the two policies continuously fight for control.
  • Figure 2: POMDP with Labels.
  • Figure 3: SafetyGym environments.
  • Figure 4: Episode return (left) and cumulative violations (right) for PointGoal1, PointGoal2 and CarGoal1.
  • Figure 5: Long run (10M frames) episode return (left) and cumulative violations (right) for PointGoal1.
  • ...and 4 more figures

Theorems & Definitions (7)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem \ref{prop:boundonm} (restated)
  • Lemma 1: error amplification
  • Theorem \ref{prop:kl} (restated)
  • Theorem \ref{prop:pomdp} Restated