Table of Contents
Fetching ...

Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model

Yinan Zheng, Jianxiong Li, Dongjie Yu, Yujie Yang, Shengbo Eben Li, Xianyuan Zhan, Jingjing Liu

TL;DR

This work tackles safe offline RL under hard safety constraints by translating state-wise feasibility into a largest feasible region $\mathcal{S}_f^*$ via Hamilton-Jacobi reachability. It decouples learning into offline feasibility identification, a feasibility-dependent reward/safety objective, and diffusion-based policy extraction using energy-guided weighting, avoiding time-dependent classifiers. The method, FISOR, yields a closed-form optimal policy form $\pi^*(a|s) \propto \pi_\beta(a|s) \cdot w(s,a)$ with region-dependent weighting, and trains a diffusion model with a simple weighted regression objective to realize this policy. Empirically, FISOR guarantees safety across 26 tasks on the DSRL benchmark while achieving state-of-the-art or near-state-of-the-art rewards, and it extends naturally to safe offline imitation learning, highlighting practical impact for safety-critical deployment with offline data. The approach offers a stable, scalable alternative to soft-constrained or coupled offline methods by leveraging explicit feasibility characterization and energy-guided diffusion for policy extraction.

Abstract

Safe offline RL is a promising way to bypass risky online interactions towards safe policy learning. Most existing methods only enforce soft constraints, i.e., constraining safety violations in expectation below thresholds predetermined. This can lead to potentially unsafe outcomes, thus unacceptable in safety-critical scenarios. An alternative is to enforce the hard constraint of zero violation. However, this can be challenging in offline setting, as it needs to strike the right balance among three highly intricate and correlated aspects: safety constraint satisfaction, reward maximization, and behavior regularization imposed by offline datasets. Interestingly, we discover that via reachability analysis of safe-control theory, the hard safety constraint can be equivalently translated to identifying the largest feasible region given the offline dataset. This seamlessly converts the original trilogy problem to a feasibility-dependent objective, i.e., maximizing reward value within the feasible region while minimizing safety risks in the infeasible region. Inspired by these, we propose FISOR (FeasIbility-guided Safe Offline RL), which allows safety constraint adherence, reward maximization, and offline policy learning to be realized via three decoupled processes, while offering strong safety performance and stability. In FISOR, the optimal policy for the translated optimization problem can be derived in a special form of weighted behavior cloning. Thus, we propose a novel energy-guided diffusion model that does not require training a complicated time-dependent classifier to extract the policy, greatly simplifying the training. We compare FISOR against baselines on DSRL benchmark for safe offline RL. Evaluation results show that FISOR is the only method that can guarantee safety satisfaction in all tasks, while achieving top returns in most tasks.

Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model

TL;DR

This work tackles safe offline RL under hard safety constraints by translating state-wise feasibility into a largest feasible region via Hamilton-Jacobi reachability. It decouples learning into offline feasibility identification, a feasibility-dependent reward/safety objective, and diffusion-based policy extraction using energy-guided weighting, avoiding time-dependent classifiers. The method, FISOR, yields a closed-form optimal policy form with region-dependent weighting, and trains a diffusion model with a simple weighted regression objective to realize this policy. Empirically, FISOR guarantees safety across 26 tasks on the DSRL benchmark while achieving state-of-the-art or near-state-of-the-art rewards, and it extends naturally to safe offline imitation learning, highlighting practical impact for safety-critical deployment with offline data. The approach offers a stable, scalable alternative to soft-constrained or coupled offline methods by leveraging explicit feasibility characterization and energy-guided diffusion for policy extraction.

Abstract

Safe offline RL is a promising way to bypass risky online interactions towards safe policy learning. Most existing methods only enforce soft constraints, i.e., constraining safety violations in expectation below thresholds predetermined. This can lead to potentially unsafe outcomes, thus unacceptable in safety-critical scenarios. An alternative is to enforce the hard constraint of zero violation. However, this can be challenging in offline setting, as it needs to strike the right balance among three highly intricate and correlated aspects: safety constraint satisfaction, reward maximization, and behavior regularization imposed by offline datasets. Interestingly, we discover that via reachability analysis of safe-control theory, the hard safety constraint can be equivalently translated to identifying the largest feasible region given the offline dataset. This seamlessly converts the original trilogy problem to a feasibility-dependent objective, i.e., maximizing reward value within the feasible region while minimizing safety risks in the infeasible region. Inspired by these, we propose FISOR (FeasIbility-guided Safe Offline RL), which allows safety constraint adherence, reward maximization, and offline policy learning to be realized via three decoupled processes, while offering strong safety performance and stability. In FISOR, the optimal policy for the translated optimization problem can be derived in a special form of weighted behavior cloning. Thus, we propose a novel energy-guided diffusion model that does not require training a complicated time-dependent classifier to extract the policy, greatly simplifying the training. We compare FISOR against baselines on DSRL benchmark for safe offline RL. Evaluation results show that FISOR is the only method that can guarantee safety satisfaction in all tasks, while achieving top returns in most tasks.
Paper Structure (33 sections, 5 theorems, 49 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 33 sections, 5 theorems, 49 equations, 12 figures, 6 tables, 1 algorithm.

Key Result

lemma 1

The optimization objectives and safety constraints in Eq. (eq:feasible_hard_constraint_obj) can be achieved by separate optimization objectives and constraints as follows (see Appendix ap:proof_lemma_trans for proof):

Figures (12)

  • Figure 1: (a) Reach-avoid control task: the agent (red) aim to reach the goal (green) while avoiding hazards (blue). (b) Offline data distribution. (c)-(e) Comparisons with the feasible region learned by feasible value $\left\{s|V^{\ast}_h(s) \leq 0 \right\}$ and cost value $\left\{s|V^{\ast}_c(s) \leq 1e^{-3} \right\}$. See Appendix \ref{['ap:toy']} for more details.
  • Figure 2: Trajectories induced by FISOR from different start points.
  • Figure 3: Soft constraint sensitivity experiments for cost limit $l$ in three environments.
  • Figure 3: Ablations on infeasible objective and diffusion policies (normalized cost).
  • Figure 5: Safe offline IL results in MetaDrive.
  • ...and 7 more figures

Theorems & Definitions (12)

  • Definition 1: Optimal feasible value function
  • Definition 2: Feasible region
  • Definition 3: Feasible policy set
  • Definition 4: Feasible Bellman operator
  • lemma 1
  • Theorem 1
  • Theorem 2: Weighted regression as exact energy guidance
  • lemma 2
  • proof
  • proof
  • ...and 2 more