Table of Contents
Fetching ...

Predictive Safety Shield for Dyna-Q Reinforcement Learning

Jin Pin, Krasowski Hanna, Vanneaux Elena

TL;DR

This work tackles the challenge of achieving hard safety guarantees in reinforcement learning by introducing a predictive safety shield for discrete-space, model-based RL, specifically integrated with Dyna-Q. The shield uses a safety-relevant environment model to perform multi-step planning and updates a local $Q$-function $Q_W$ to bias action selection toward safe, high-return trajectories, addressing sim-to-real gaps without retraining. The authors prove optimality under full observability with static obstacles and demonstrate, in gridworld experiments, that short horizons ($N$ small) can yield near-optimal solutions and robust performance under distribution shifts. This approach enables safer, more reliable RL in discrete domains and offers a foundation for extending to continuous spaces and more dynamic environments in future work.

Abstract

Obtaining safety guarantees for reinforcement learning is a major challenge to achieve applicability for real-world tasks. Safety shields extend standard reinforcement learning and achieve hard safety guarantees. However, existing safety shields commonly use random sampling of safe actions or a fixed fallback controller, therefore disregarding future performance implications of different safe actions. In this work, we propose a predictive safety shield for model-based reinforcement learning agents in discrete space. Our safety shield updates the Q-function locally based on safe predictions, which originate from a safe simulation of the environment model. This shielding approach improves performance while maintaining hard safety guarantees. Our experiments on gridworld environments demonstrate that even short prediction horizons can be sufficient to identify the optimal path. We observe that our approach is robust to distribution shifts, e.g., between simulation and reality, without requiring additional training.

Predictive Safety Shield for Dyna-Q Reinforcement Learning

TL;DR

This work tackles the challenge of achieving hard safety guarantees in reinforcement learning by introducing a predictive safety shield for discrete-space, model-based RL, specifically integrated with Dyna-Q. The shield uses a safety-relevant environment model to perform multi-step planning and updates a local -function to bias action selection toward safe, high-return trajectories, addressing sim-to-real gaps without retraining. The authors prove optimality under full observability with static obstacles and demonstrate, in gridworld experiments, that short horizons ( small) can yield near-optimal solutions and robust performance under distribution shifts. This approach enables safer, more reliable RL in discrete domains and offers a foundation for extending to continuous spaces and more dynamic environments in future work.

Abstract

Obtaining safety guarantees for reinforcement learning is a major challenge to achieve applicability for real-world tasks. Safety shields extend standard reinforcement learning and achieve hard safety guarantees. However, existing safety shields commonly use random sampling of safe actions or a fixed fallback controller, therefore disregarding future performance implications of different safe actions. In this work, we propose a predictive safety shield for model-based reinforcement learning agents in discrete space. Our safety shield updates the Q-function locally based on safe predictions, which originate from a safe simulation of the environment model. This shielding approach improves performance while maintaining hard safety guarantees. Our experiments on gridworld environments demonstrate that even short prediction horizons can be sufficient to identify the optimal path. We observe that our approach is robust to distribution shifts, e.g., between simulation and reality, without requiring additional training.

Paper Structure

This paper contains 16 sections, 1 theorem, 4 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem V.1

Let the environment be modeled as MDP $\mathcal{P} = (S, A, T,r)$. Let a safety controller $C_t$ for a safety-relevant model $\Sigma_{\mathcal{P}} = (S,U_{A},F)$ and a specification $S^S_t\subseteq S, t = 0,1,2\ldots$ be provided. Then, an agent with a shield designed according to the Algorithm alg

Figures (4)

  • Figure 1: Gridworlds for distributional shift problem. The left figure is the environment used for training, while the right one is for deployment.
  • Figure 2: A variant of classic grid maze. The gate in the left figure is opened, reflecting the training environment, while that in the right one is closed but will be opened at the third time step.
  • Figure 3: Impact of prediction horizon on agent trajectories. The color of the trajectories on the left corresponds to that of the plots on the right, representing different test conditions. Red crosses mean that the agent is stuck in a loop in these grids.
  • Figure 4: Trajectory comparisons between retraining of RL algorithms such as Dyna-Q learning and our approach. The color of the trajectories on the left corresponds to that of the plots on the right, representing different test setups.

Theorems & Definitions (5)

  • Definition III.1
  • Definition IV.1
  • Definition IV.2
  • Definition IV.3
  • Theorem V.1