Table of Contents
Fetching ...

SafeDreamer: Safe Reinforcement Learning with World Models

Weidong Huang, Jiaming Ji, Chunhe Xia, Borong Zhang, Yaodong Yang

TL;DR

SafeDreamer tackles safety in reinforcement learning by integrating safety-aware world-model planning with Lagrangian balance between reward and cost. It introduces online and background planning variants (OSRP, OSRP-Lag, BSRP-Lag) within a DreamerV3-based framework, using CCEM to approach zero-cost safety on vision-based tasks. The approach demonstrates near-zero cost across Safety-Gymnasium benchmarks while achieving competitive rewards, with ablations highlighting the importance of world-model fidelity and planning horizon. Reproducibility resources, including code and checkpoints, are released to support further research and application.

Abstract

The deployment of Reinforcement Learning (RL) in real-world applications is constrained by its failure to satisfy safety criteria. Existing Safe Reinforcement Learning (SafeRL) methods, which rely on cost functions to enforce safety, often fail to achieve zero-cost performance in complex scenarios, especially vision-only tasks. These limitations are primarily due to model inaccuracies and inadequate sample efficiency. The integration of the world model has proven effective in mitigating these shortcomings. In this work, we introduce SafeDreamer, a novel algorithm incorporating Lagrangian-based methods into world model planning processes within the superior Dreamer framework. Our method achieves nearly zero-cost performance on various tasks, spanning low-dimensional and vision-only input, within the Safety-Gymnasium benchmark, showcasing its efficacy in balancing performance and safety in RL tasks. Further details can be found in the code repository: \url{https://github.com/PKU-Alignment/SafeDreamer}.

SafeDreamer: Safe Reinforcement Learning with World Models

TL;DR

SafeDreamer tackles safety in reinforcement learning by integrating safety-aware world-model planning with Lagrangian balance between reward and cost. It introduces online and background planning variants (OSRP, OSRP-Lag, BSRP-Lag) within a DreamerV3-based framework, using CCEM to approach zero-cost safety on vision-based tasks. The approach demonstrates near-zero cost across Safety-Gymnasium benchmarks while achieving competitive rewards, with ablations highlighting the importance of world-model fidelity and planning horizon. Reproducibility resources, including code and checkpoints, are released to support further research and application.

Abstract

The deployment of Reinforcement Learning (RL) in real-world applications is constrained by its failure to satisfy safety criteria. Existing Safe Reinforcement Learning (SafeRL) methods, which rely on cost functions to enforce safety, often fail to achieve zero-cost performance in complex scenarios, especially vision-only tasks. These limitations are primarily due to model inaccuracies and inadequate sample efficiency. The integration of the world model has proven effective in mitigating these shortcomings. In this work, we introduce SafeDreamer, a novel algorithm incorporating Lagrangian-based methods into world model planning processes within the superior Dreamer framework. Our method achieves nearly zero-cost performance on various tasks, spanning low-dimensional and vision-only input, within the Safety-Gymnasium benchmark, showcasing its efficacy in balancing performance and safety in RL tasks. Further details can be found in the code repository: \url{https://github.com/PKU-Alignment/SafeDreamer}.
Paper Structure (43 sections, 14 equations, 27 figures, 9 tables, 3 algorithms)

This paper contains 43 sections, 14 equations, 27 figures, 9 tables, 3 algorithms.

Figures (27)

  • Figure 1: The Architecture of SafeDreamer. (a) illustrates all components of SafeDreamer, which distinguishes costs as safety indicators from rewards and balances them using the Lagrangian method and a safe planner. The OSRP (b) and OSRP-Lag (c) variants execute online safety-reward planning (OSRP) within the world model for action generation, especially OSRP-Lag integrates online planning with the Lagrangian approach to balance long-term rewards and costs. The BSRP-Lag variant of SafeDreamer (d) employs background safety-reward planning (BSRP) via the Lagrangian method within the world model to update a safe actor.
  • Figure 2: Safety-reward planning process. The agent acquires an observation and employs the encoder to distill it into a latent state $s_1$. Subsequently, the agent generates action trajectories via policy and executes them within the world model, predicting latent rollouts of the model state and a reward, cost, reward value, and cost value with each latent state. We employ TD($\lambda$) 2020dreamerv2 to estimate reward and cost return for each trajectory that are used to update the policy.
  • Figure 3: Experimental results from the five vision tasks for the model-based methods. The results are recorded after the agent completes the 2M training steps. We normalize the metrics following safety_gym and utilize the rliable library agarwal2021deep to calculate the median, inter-quartile mean (IQM), and mean estimates for normalized reward and cost returns.
  • Figure 4: Comparing SafeDreamer to model-based baselines across five image-based safety tasks.
  • Figure 5: Results in low-dimensional input tasks.
  • ...and 22 more figures