GAS: Enhancing Reward-Cost Balance of Generative Model-assisted Offline Safe RL

Zifan Liu; Xinran Li; Shibo Chen; Jun Zhang

GAS: Enhancing Reward-Cost Balance of Generative Model-assisted Offline Safe RL

Zifan Liu, Xinran Li, Shibo Chen, Jun Zhang

TL;DR

The paper tackles offline safe RL by addressing the two main limitations of generative-model-based approaches: limited trajectory stitching and imbalanced reward-cost optimization. It introduces Goal-Assisted Stitching (GAS), a framework that augments and relabels offline data at the transition level, learns optimal achievable reward and cost goals via expectile regression, and reshapes data distributions to stabilize training. GAS leverages temporal segmented returns and transition-level relabeling to guide policy optimization through a constrained, goal-conditioned backbone, enabling robust safe performance with improved reward. Empirically, GAS achieves superior reward-cost tradeoffs, stronger safety under tight constraints, zero-shot adaptability to different constraint thresholds, and robustness to imbalanced target rewards and costs on Bullet-safety-gym and Safety-gymnasium benchmarks.

Abstract

Offline Safe Reinforcement Learning (OSRL) aims to learn a policy to achieve high performance in sequential decision-making while satisfying constraints, using only pre-collected datasets. Recent works, inspired by the strong capabilities of Generative Models (GMs), reformulate decision-making in OSRL as a conditional generative process, where GMs generate desirable actions conditioned on predefined reward and cost values. However, GM-assisted methods face two major challenges in OSRL: (1) lacking the ability to "stitch" optimal transitions from suboptimal trajectories within the dataset, and (2) struggling to balance reward targets with cost targets, particularly when they are conflict. To address these issues, we propose Goal-Assisted Stitching (GAS), a novel algorithm designed to enhance stitching capabilities while effectively balancing reward maximization and constraint satisfaction. To enhance the stitching ability, GAS first augments and relabels the dataset at the transition level, enabling the construction of high-quality trajectories from suboptimal ones. GAS also introduces novel goal functions, which estimate the optimal achievable reward and cost goals from the dataset. These goal functions, trained using expectile regression on the relabeled and augmented dataset, allow GAS to accommodate a broader range of reward-cost return pairs and achieve a better tradeoff between reward maximization and constraint satisfaction compared to human-specified values. The estimated goals then guide policy training, ensuring robust performance under constrained settings. Furthermore, to improve training stability and efficiency, we reshape the dataset to achieve a more uniform reward-cost return distribution. Empirical results validate the effectiveness of GAS, demonstrating superior performance in balancing reward maximization and constraint satisfaction compared to existing methods.

GAS: Enhancing Reward-Cost Balance of Generative Model-assisted Offline Safe RL

TL;DR

Abstract

Paper Structure (34 sections, 25 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 34 sections, 25 equations, 11 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Background
Motivation
Insufficient Trajectory Stitching Capabilities
Inability to Balance Reward Maximization and Constraint Satisfaction
Goal-Assisted Stitching
Optimal Achievable Goals
Temporal Segmented Return Augmentation
Transition-level Return Relabeling
Goal Functions with Expectile Regressions
Dataset Reshaping
Experiment
Can GAS achieve both safe and better performance with improved stitching ability?
Can GAS preserve the zero-shot adaptation ability to different constraint thresholds?
...and 19 more sections

Figures (11)

Figure 1: Training curve of CDT with different memory length $K$ on task CarCircle.
Figure 2: Overall view of GAS. Left: Comparison between CDT and GAS. CDT optimizes at a trajectory level, while GAS enables fine-grained trajectory stitching under the guidance of reward maximization and constraint satisfaction. Right: GAS's stitching mechanism, where the goal function learns the optimal reward and cost return-to-go within the constraint given any target. The policy aims to take actions to achieve optimal goals estimated by the goal function via constrained AWR.
Figure 3: Necessity of temporal segmented return augmentation.
Figure 4: Original and reshaped dataset distribution.
Figure 5: Evaluation results on zero-short adaptation. The x-axis indicates different selected thresholds and the y-axis indicates corresponding performance on cumulative rewards and costs. "Ideal" line indicates the case when the cumulative costs are equal to the constraint thresholds.
...and 6 more figures

GAS: Enhancing Reward-Cost Balance of Generative Model-assisted Offline Safe RL

TL;DR

Abstract

GAS: Enhancing Reward-Cost Balance of Generative Model-assisted Offline Safe RL

Authors

TL;DR

Abstract

Table of Contents

Figures (11)