Table of Contents
Fetching ...

Boundary-to-Region Supervision for Offline Safe Reinforcement Learning

Huikang Su, Dengyun Peng, Zifeng Zhuang, YuHan Liu, Qiguang Chen, Donglin Wang, Qinghe Liu

TL;DR

B2R redefines CTG as a boundary constraint under a fixed safety budget, unifying the cost distribution of all feasible trajectories while preserving reward structures and establishes a new theoretical and practical approach for applying sequence models to safe RL.

Abstract

Offline safe reinforcement learning aims to learn policies that satisfy predefined safety constraints from static datasets. Existing sequence-model-based methods condition action generation on symmetric input tokens for return-to-go and cost-to-go, neglecting their intrinsic asymmetry: return-to-go (RTG) serves as a flexible performance target, while cost-to-go (CTG) should represent a rigid safety boundary. This symmetric conditioning leads to unreliable constraint satisfaction, especially when encountering out-of-distribution cost trajectories. To address this, we propose Boundary-to-Region (B2R), a framework that enables asymmetric conditioning through cost signal realignment . B2R redefines CTG as a boundary constraint under a fixed safety budget, unifying the cost distribution of all feasible trajectories while preserving reward structures. Combined with rotary positional embeddings , it enhances exploration within the safe region. Experimental results show that B2R satisfies safety constraints in 35 out of 38 safety-critical tasks while achieving superior reward performance over baseline methods. This work highlights the limitations of symmetric token conditioning and establishes a new theoretical and practical approach for applying sequence models to safe RL. Our code is available at https://github.com/HuikangSu/B2R.

Boundary-to-Region Supervision for Offline Safe Reinforcement Learning

TL;DR

B2R redefines CTG as a boundary constraint under a fixed safety budget, unifying the cost distribution of all feasible trajectories while preserving reward structures and establishes a new theoretical and practical approach for applying sequence models to safe RL.

Abstract

Offline safe reinforcement learning aims to learn policies that satisfy predefined safety constraints from static datasets. Existing sequence-model-based methods condition action generation on symmetric input tokens for return-to-go and cost-to-go, neglecting their intrinsic asymmetry: return-to-go (RTG) serves as a flexible performance target, while cost-to-go (CTG) should represent a rigid safety boundary. This symmetric conditioning leads to unreliable constraint satisfaction, especially when encountering out-of-distribution cost trajectories. To address this, we propose Boundary-to-Region (B2R), a framework that enables asymmetric conditioning through cost signal realignment . B2R redefines CTG as a boundary constraint under a fixed safety budget, unifying the cost distribution of all feasible trajectories while preserving reward structures. Combined with rotary positional embeddings , it enhances exploration within the safe region. Experimental results show that B2R satisfies safety constraints in 35 out of 38 safety-critical tasks while achieving superior reward performance over baseline methods. This work highlights the limitations of symmetric token conditioning and establishes a new theoretical and practical approach for applying sequence models to safe RL. Our code is available at https://github.com/HuikangSu/B2R.

Paper Structure

This paper contains 42 sections, 37 equations, 8 figures, 7 tables, 2 algorithms.

Figures (8)

  • Figure 1: Overview of the B2R framework compared to DT methods. DT approaches rely on boundary-aligned trajectories whose costs happen to match the constraint threshold, making it difficult to supervise diverse safe behaviors and often resulting in unsafe, high-cost actions. To address this, the B2R pipeline introduces Trajectory Filtering to remove unsafe samples and CTG Realignment to align all remaining trajectories with the deployment-time cost threshold. This transforms sparse boundary supervision into consistent training over a broader Safe Region, reducing expected cost (1.5). In contrast, DT methods lack such filtering and alignment, frequently generating actions beyond the constraint, leading to higher expected cost (2.0), as shown in the right subfigure.
  • Figure 2: Velocity profiles in a simplified MetaDrive scenario. Training on boundary-aligned trajectories results in unstable behavior and frequent violations (Non-ideal_V), while B2R achieves smooth, constraint-compliant control.
  • Figure 3: Supervision strategy comparison. Conventional methods (a, b) rely on sparse, boundary-aligned trajectories. In contrast, B2R (c) realigns all safe trajectories to the constraint threshold (dashed line), transforming sparse boundary data into dense, region-wide supervision. Orange dots denote compliant trajectories; dashed arrows show the realignment.
  • Figure 4: Architecture of the Transformer model in the B2R framework. The model takes tokenized inputs consisting of states, actions, RTG, and CTG, and augments them with RoPE for improved temporal modeling.
  • Figure 5: Average B2R performance on BulletSafetyGym (BS), SafetyGymnasium (SG), and MetaDrive (MD) across three constraint levels (L1-L3). Tighter constraints generally decrease cost while rewards remain stable or improve. The non-monotonic cost trend in MD is likely an artifact of L1 filtering bias (see Appendix \ref{['APP:metrics']} for thresholds).
  • ...and 3 more figures