Conditional Sequence Modeling for Safe Reinforcement Learning
Wensong Bai, Chao Zhang, Qihang Xu, Chufan Chen, Chenhao Zhou, Hui Qian
TL;DR
This paper tackles offline safe reinforcement learning with varying cost budgets by introducing Return-Cost Regularized Constrained Decision Transformer (RCDT), a conditional sequence modeling approach that enables zero-shot adaptation across multiple cost thresholds. RCDT augments the Constrained Decision Transformer (CDT) with three components: an auto-adaptive Lagrangian-style cost penalty, trajectory-level return-cost-aware reweighting, and Q-value regularization, all while preserving RTG/CTG conditioning to support multi-threshold deployment. The authors provide a theoretical alignment bound showing when RTG/CTG conditioning can reliably realize targeted return-cost profiles, dependent on data coverage $\alpha_F$ and horizon $H$, and demonstrate experimentally that RCDT achieves superior return-cost trade-offs on the DSRL benchmark across SafetyGym, BulletSafetyGym, and MetaDrive. The results indicate that incorporating data-aware weighting and value guidance with an adaptive safety penalty yields robust, zero-shot safe policies in diverse, safety-critical domains, suggesting a practical pathway for deployment-ready offline safe RL systems.
Abstract
Offline safe reinforcement learning (RL) aims to learn policies from a fixed dataset while maximizing performance under cumulative cost constraints. In practice, deployment requirements often vary across scenarios, necessitating a single policy that can adapt zero-shot to different cost thresholds. However, most existing offline safe RL methods are trained under a pre-specified threshold, yielding policies with limited generalization and deployment flexibility across cost thresholds. Motivated by recent progress in conditional sequence modeling (CSM), which enables flexible goal-conditioned control by specifying target returns, we propose RCDT, a CSM-based method that supports zero-shot deployment across multiple cost thresholds within a single trained policy. RCDT is the first CSM-based offline safe RL algorithm that integrates a Lagrangian-style cost penalty with an auto-adaptive penalty coefficient. To avoid overly conservative behavior and achieve a more favorable return--cost trade-off, a reward--cost-aware trajectory reweighting mechanism and Q-value regularization are further incorporated. Extensive experiments on the DSRL benchmark demonstrate that RCDT consistently improves return--cost trade-offs over representative baselines, advancing the state-of-the-art in offline safe RL.
