Table of Contents
Fetching ...

Safe Reinforcement Learning with Learned Non-Markovian Safety Constraints

Siow Meng Low, Akshat Kumar

TL;DR

This paper tackles safe reinforcement learning under non-Markovian safety constraints when explicit safety costs are unknown. It introduces a trajectory-based safety model trained on labeled data to capture history-dependent safety and integrates it into an RL-as-inference framework, resulting in the SafeSAC-H algorithm that uses a history-conditioned policy and dual optimization with a dynamically adapted Lagrange multiplier. By recasting the problem with a probabilistic graphical model and variational inference, the authors derive a practical, off-policy method that jointly optimizes reward and safety through two critics and a safety term, with an automated mechanism to enforce constraint satisfaction. Empirical results across MuJoCo, Bullet Safety Gym, and RDDL Gym demonstrate that SafeSAC-H achieves high returns while meeting sophisticated non-Markovian safety constraints, outperforming unconstrained baselines and ablations that remove history or dynamic lambda adaptation.

Abstract

In safe Reinforcement Learning (RL), safety cost is typically defined as a function dependent on the immediate state and actions. In practice, safety constraints can often be non-Markovian due to the insufficient fidelity of state representation, and safety cost may not be known. We therefore address a general setting where safety labels (e.g., safe or unsafe) are associated with state-action trajectories. Our key contributions are: first, we design a safety model that specifically performs credit assignment to assess contributions of partial state-action trajectories on safety. This safety model is trained using a labeled safety dataset. Second, using RL-as-inference strategy we derive an effective algorithm for optimizing a safe policy using the learned safety model. Finally, we devise a method to dynamically adapt the tradeoff coefficient between reward maximization and safety compliance. We rewrite the constrained optimization problem into its dual problem and derive a gradient-based method to dynamically adjust the tradeoff coefficient during training. Our empirical results demonstrate that this approach is highly scalable and able to satisfy sophisticated non-Markovian safety constraints.

Safe Reinforcement Learning with Learned Non-Markovian Safety Constraints

TL;DR

This paper tackles safe reinforcement learning under non-Markovian safety constraints when explicit safety costs are unknown. It introduces a trajectory-based safety model trained on labeled data to capture history-dependent safety and integrates it into an RL-as-inference framework, resulting in the SafeSAC-H algorithm that uses a history-conditioned policy and dual optimization with a dynamically adapted Lagrange multiplier. By recasting the problem with a probabilistic graphical model and variational inference, the authors derive a practical, off-policy method that jointly optimizes reward and safety through two critics and a safety term, with an automated mechanism to enforce constraint satisfaction. Empirical results across MuJoCo, Bullet Safety Gym, and RDDL Gym demonstrate that SafeSAC-H achieves high returns while meeting sophisticated non-Markovian safety constraints, outperforming unconstrained baselines and ablations that remove history or dynamic lambda adaptation.

Abstract

In safe Reinforcement Learning (RL), safety cost is typically defined as a function dependent on the immediate state and actions. In practice, safety constraints can often be non-Markovian due to the insufficient fidelity of state representation, and safety cost may not be known. We therefore address a general setting where safety labels (e.g., safe or unsafe) are associated with state-action trajectories. Our key contributions are: first, we design a safety model that specifically performs credit assignment to assess contributions of partial state-action trajectories on safety. This safety model is trained using a labeled safety dataset. Second, using RL-as-inference strategy we derive an effective algorithm for optimizing a safe policy using the learned safety model. Finally, we devise a method to dynamically adapt the tradeoff coefficient between reward maximization and safety compliance. We rewrite the constrained optimization problem into its dual problem and derive a gradient-based method to dynamically adjust the tradeoff coefficient during training. Our empirical results demonstrate that this approach is highly scalable and able to satisfy sophisticated non-Markovian safety constraints.
Paper Structure (30 sections, 27 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 30 sections, 27 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Non-Markovian Safety Model
  • Figure 2: Graphical Model with two sets of optimality variables
  • Figure 3: MuJoCo Experiments - Reward and Safety Performance
  • Figure 4: Safety Bullet Gym Experiments - Reward and Safety Performance
  • Figure 5: MuJoCo Results - Reward and Safety Performance
  • ...and 2 more figures