Table of Contents
Fetching ...

OASIS: Conditional Distribution Shaping for Offline Safe Reinforcement Learning

Yihang Yao, Zhepeng Cen, Wenhao Ding, Haohong Lin, Shiqi Liu, Tingnan Zhang, Wenhao Yu, Ding Zhao

TL;DR

This work tackles offline safe RL by addressing Safe Dataset Mismatch (SDM), where imperfect demonstrations bias the learned policy away from the safe, high-reward optimum. It introduces OASIS, a constraint-conditioned diffusion model that shapes the offline data distribution toward a target domain by generating data under a cost constraint, enabling safer and more rewarding policies while remaining compatible with existing offline RL frameworks. The authors provide theoretical bounds on the distribution-shaping error and constraint-violation risk, and demonstrate via extensive experiments on Bullet-Safety-Gym tasks that OASIS achieves superior safe performance and data efficiency, even with limited generated data. While promising, they note limitations such as longer offline training times and the challenge of achieving zero constraint violations with imperfect demonstrations, highlighting practical impact for real-world safety-critical tasks.

Abstract

Offline safe reinforcement learning (RL) aims to train a policy that satisfies constraints using a pre-collected dataset. Most current methods struggle with the mismatch between imperfect demonstrations and the desired safe and rewarding performance. In this paper, we introduce OASIS (cOnditionAl diStributIon Shaping), a new paradigm in offline safe RL designed to overcome these critical limitations. OASIS utilizes a conditional diffusion model to synthesize offline datasets, thus shaping the data distribution toward a beneficial target domain. Our approach makes compliance with safety constraints through effective data utilization and regularization techniques to benefit offline safe RL training. Comprehensive evaluations on public benchmarks and varying datasets showcase OASIS's superiority in benefiting offline safe RL agents to achieve high-reward behavior while satisfying the safety constraints, outperforming established baselines. Furthermore, OASIS exhibits high data efficiency and robustness, making it suitable for real-world applications, particularly in tasks where safety is imperative and high-quality demonstrations are scarce.

OASIS: Conditional Distribution Shaping for Offline Safe Reinforcement Learning

TL;DR

This work tackles offline safe RL by addressing Safe Dataset Mismatch (SDM), where imperfect demonstrations bias the learned policy away from the safe, high-reward optimum. It introduces OASIS, a constraint-conditioned diffusion model that shapes the offline data distribution toward a target domain by generating data under a cost constraint, enabling safer and more rewarding policies while remaining compatible with existing offline RL frameworks. The authors provide theoretical bounds on the distribution-shaping error and constraint-violation risk, and demonstrate via extensive experiments on Bullet-Safety-Gym tasks that OASIS achieves superior safe performance and data efficiency, even with limited generated data. While promising, they note limitations such as longer offline training times and the challenge of achieving zero constraint violations with imperfect demonstrations, highlighting practical impact for real-world safety-critical tasks.

Abstract

Offline safe reinforcement learning (RL) aims to train a policy that satisfies constraints using a pre-collected dataset. Most current methods struggle with the mismatch between imperfect demonstrations and the desired safe and rewarding performance. In this paper, we introduce OASIS (cOnditionAl diStributIon Shaping), a new paradigm in offline safe RL designed to overcome these critical limitations. OASIS utilizes a conditional diffusion model to synthesize offline datasets, thus shaping the data distribution toward a beneficial target domain. Our approach makes compliance with safety constraints through effective data utilization and regularization techniques to benefit offline safe RL training. Comprehensive evaluations on public benchmarks and varying datasets showcase OASIS's superiority in benefiting offline safe RL agents to achieve high-reward behavior while satisfying the safety constraints, outperforming established baselines. Furthermore, OASIS exhibits high data efficiency and robustness, making it suitable for real-world applications, particularly in tasks where safety is imperative and high-quality demonstrations are scarce.
Paper Structure (27 sections, 2 theorems, 33 equations, 13 figures, 4 tables, 2 algorithms)

This paper contains 27 sections, 2 theorems, 33 equations, 13 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Suppose that the optimal stationary state distribution satisfies that 1) its score function $\nabla_{s} \log d^*(s)$ is $L$-Lipschitz and 2) its second momentum is bounded. Under Assumption ass:score and ass:inverse, the gap of generated state-action distribution to the optimal stationary state-acti where $C(d^*(s), L, K)$ represents a constant determined by $d^*(s), L$ and $K$.

Figures (13)

  • Figure 1: An example of distribution shaping in offline safe RL. We generate a low-cost and high-reward dataset from the original dataset for subsequent RL training.
  • Figure 2: ${\mathcal{D}}_1$ is a conservative dataset, and ${\mathcal{D}}_2$ is a tempting dataset. Each point represents $(C(\tau), R(\tau))$ of a trajectory $\tau$ in the dataset.
  • Figure 3: (a) Reweighting in the dataset with comprehensive coverage. (b) Reweighting in the tempting dataset. (c) Performance evaluation with different weights and datasets.
  • Figure 4: OASIS overview.
  • Figure 5: Performance with different datasets and varying constraint thresholds.
  • ...and 8 more figures

Theorems & Definitions (5)

  • Definition 1: Tempting policy liu2022robustness and conservative policy
  • Definition 2: Tempting and conservative dataset
  • Theorem 1: Distribution shaping error bound
  • Theorem 2: Constraint violation bound
  • proof