Conditional Sequence Modeling for Safe Reinforcement Learning

Wensong Bai; Chao Zhang; Qihang Xu; Chufan Chen; Chenhao Zhou; Hui Qian

Conditional Sequence Modeling for Safe Reinforcement Learning

Wensong Bai, Chao Zhang, Qihang Xu, Chufan Chen, Chenhao Zhou, Hui Qian

TL;DR

This paper tackles offline safe reinforcement learning with varying cost budgets by introducing Return-Cost Regularized Constrained Decision Transformer (RCDT), a conditional sequence modeling approach that enables zero-shot adaptation across multiple cost thresholds. RCDT augments the Constrained Decision Transformer (CDT) with three components: an auto-adaptive Lagrangian-style cost penalty, trajectory-level return-cost-aware reweighting, and Q-value regularization, all while preserving RTG/CTG conditioning to support multi-threshold deployment. The authors provide a theoretical alignment bound showing when RTG/CTG conditioning can reliably realize targeted return-cost profiles, dependent on data coverage $\alpha_F$ and horizon $H$, and demonstrate experimentally that RCDT achieves superior return-cost trade-offs on the DSRL benchmark across SafetyGym, BulletSafetyGym, and MetaDrive. The results indicate that incorporating data-aware weighting and value guidance with an adaptive safety penalty yields robust, zero-shot safe policies in diverse, safety-critical domains, suggesting a practical pathway for deployment-ready offline safe RL systems.

Abstract

Offline safe reinforcement learning (RL) aims to learn policies from a fixed dataset while maximizing performance under cumulative cost constraints. In practice, deployment requirements often vary across scenarios, necessitating a single policy that can adapt zero-shot to different cost thresholds. However, most existing offline safe RL methods are trained under a pre-specified threshold, yielding policies with limited generalization and deployment flexibility across cost thresholds. Motivated by recent progress in conditional sequence modeling (CSM), which enables flexible goal-conditioned control by specifying target returns, we propose RCDT, a CSM-based method that supports zero-shot deployment across multiple cost thresholds within a single trained policy. RCDT is the first CSM-based offline safe RL algorithm that integrates a Lagrangian-style cost penalty with an auto-adaptive penalty coefficient. To avoid overly conservative behavior and achieve a more favorable return--cost trade-off, a reward--cost-aware trajectory reweighting mechanism and Q-value regularization are further incorporated. Extensive experiments on the DSRL benchmark demonstrate that RCDT consistently improves return--cost trade-offs over representative baselines, advancing the state-of-the-art in offline safe RL.

Conditional Sequence Modeling for Safe Reinforcement Learning

TL;DR

and horizon

, and demonstrate experimentally that RCDT achieves superior return-cost trade-offs on the DSRL benchmark across SafetyGym, BulletSafetyGym, and MetaDrive. The results indicate that incorporating data-aware weighting and value guidance with an adaptive safety penalty yields robust, zero-shot safe policies in diverse, safety-critical domains, suggesting a practical pathway for deployment-ready offline safe RL systems.

Abstract

Paper Structure (24 sections, 2 theorems, 45 equations, 2 figures, 8 tables, 1 algorithm)

This paper contains 24 sections, 2 theorems, 45 equations, 2 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Offline Safe Reinforcement Learning
Conditional Sequence Modeling for Offline RL
Preliminary
Constrained Markov Decision Process
Conditioned Sequence Modeling
Methodology
Conditioned Sequence Modeling in CMDPs
Trajectory-weighted Value-regularized Constrained Decision Transformer
Return-Cost Regularized Constrained Decision Transformer
Training and inference
Experiments
Main Results for RCDT on DSRL Benchmark
Ablation Studies
...and 9 more sections

Key Result

Theorem 1

Consider a finite-horizon CMDP, a behavior policy $\pi_\beta$, and a conditioning function $F : \mathcal{S} \to \mathbb{R}^2$. Suppose Assumption assump:cmdp-near-det is satisfied. Let $\pi^{\mathrm{CDT}}_F$ denote the CDT policy in eq:cdt-optimal-policy. Then there exists a universal constant $C >

Figures (2)

Figure 1: Overview of constrained decision transformer architecture.
Figure 2: Return--cost distributions of offline trajectories on representative tasks. Each point denotes a trajectory, with cumulative cost on the $x$-axis and return on the $y$-axis; colour intensity indicates the trajectory weight $W_{\tau}$ (darker means larger weight) defined in Eq. \ref{['eq:w-def']}.

Theorems & Definitions (4)

Theorem 1: Joint alignment with respect to the conditioning function in CMDPs
Proposition 1: KL regularization as a special case of trajectory weighting
proof
proof

Conditional Sequence Modeling for Safe Reinforcement Learning

TL;DR

Abstract

Conditional Sequence Modeling for Safe Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (4)