Uniformly Safe RL with Objective Suppression for Multi-Constraint Safety-Critical Applications
Zihan Zhou, Jonathan Booher, Khashayar Rohanimanesh, Wei Liu, Aleksandr Petiushko, Animesh Garg
TL;DR
This work targets safety in reinforcement learning by criticizing CMDP-based risk constraints that act in expectation and can miss rare, high-risk states. It introduces Uniformly Constrained MDPs (UCMDPs), which impose constraints uniformly across all reachable states, and proposes Objective Suppression to solve the resulting Lagrangian dual via a state-aware, non-parametric multiplier surrogate. The method adapts the optimization objective to balance task rewards and safety risks, using a gradient form that resembles a weighted combination of $Q_R^{\pi}$ and $Q_{C_i}^{\pi}$ with context-sensitive weights, and it can be integrated with hierarchical safe RL approaches like Recovery RL. Empirically, Objective Suppression improves safety in multi-constraint driving domains (e.g., Safe Mujoco-Ant and Safe Bench) while preserving task performance, and transfers to real autonomous fleet data with meaningful reductions in collisions and harsh braking, demonstrating practical impact for safety-critical applications.
Abstract
Safe reinforcement learning tasks are a challenging domain despite being very common in the real world. The widely adopted CMDP model constrains the risks in expectation, which makes room for dangerous behaviors in long-tail states. In safety-critical domains, such behaviors could lead to disastrous outcomes. To address this issue, we first describe the problem with a stronger Uniformly Constrained MDP (UCMDP) model where we impose constraints on all reachable states; we then propose Objective Suppression, a novel method that adaptively suppresses the task reward maximizing objectives according to a safety critic, as a solution to the Lagrangian dual of a UCMDP. We benchmark Objective Suppression in two multi-constraint safety domains, including an autonomous driving domain where any incorrect behavior can lead to disastrous consequences. On the driving domain, we evaluate on open source and proprietary data and evaluate transfer to a real autonomous fleet. Empirically, we demonstrate that our proposed method, when combined with existing safe RL algorithms, can match the task reward achieved by baselines with significantly fewer constraint violations.
