Constraint-Conditioned Policy Optimization for Versatile Safe Reinforcement Learning
Yihang Yao, Zuxin Liu, Zhepeng Cen, Jiacheng Zhu, Wenhao Yu, Tingnan Zhang, Ding Zhao
TL;DR
This work tackles versatile safe reinforcement learning by addressing how to adapt to varying safety thresholds without retraining. It introduces CCPO, a framework built from Versatile Value Estimation (VVE) and Conditioned Variational Inference (CVI) to enable zero-shot generalization to unseen constraint thresholds while maintaining safety and task performance. Theoretical analysis provides bounded safety violations and epsilon-sample efficiency guarantees, and empirical results across multiple tasks demonstrate that CCPO outperforms baselines, especially in high-dimensional settings where other methods fail to generalize. The approach offers a practically impactful path toward deployable safe RL in dynamic environments, with clear directions for improving computational efficiency and broadening applicability.
Abstract
Safe reinforcement learning (RL) focuses on training reward-maximizing agents subject to pre-defined safety constraints. Yet, learning versatile safe policies that can adapt to varying safety constraint requirements during deployment without retraining remains a largely unexplored and challenging area. In this work, we formulate the versatile safe RL problem and consider two primary requirements: training efficiency and zero-shot adaptation capability. To address them, we introduce the Conditioned Constrained Policy Optimization (CCPO) framework, consisting of two key modules: (1) Versatile Value Estimation (VVE) for approximating value functions under unseen threshold conditions, and (2) Conditioned Variational Inference (CVI) for encoding arbitrary constraint thresholds during policy optimization. Our extensive experiments demonstrate that CCPO outperforms the baselines in terms of safety and task performance while preserving zero-shot adaptation capabilities to different constraint thresholds data-efficiently. This makes our approach suitable for real-world dynamic applications.
