Offline Goal-Conditioned Reinforcement Learning for Safety-Critical Tasks with Recovery Policy

Chenyang Cao; Zichen Yan; Renhao Lu; Junbo Tan; Xueqian Wang

Offline Goal-Conditioned Reinforcement Learning for Safety-Critical Tasks with Recovery Policy

Chenyang Cao, Zichen Yan, Renhao Lu, Junbo Tan, Xueqian Wang

TL;DR

This work tackles constrained offline goal-conditioned reinforcement learning by introducing Recovery-based Supervised Learning (RbSL), a two-policy framework that jointly optimizes a goal-reaching policy and a recovery policy to satisfy safety constraints. The method leverages hindsight relabeling, OOD action detection, and cost-aware data processing to train efficiently on offline data, switching between policies via a learned cost-Q value $Q_C(s,a,g)$. Empirical results across four obstacle-rich manipulation tasks show that RbSL achieves higher success rates and lower constraint violations than strong offline GCRL baselines, with robust performance across varying data qualities and a successful sim-to-real deployment on a Panda manipulator. The work provides a practical, scalable approach to safe offline GCRL with real-world impact, and releases code for reproducibility.

Abstract

Offline goal-conditioned reinforcement learning (GCRL) aims at solving goal-reaching tasks with sparse rewards from an offline dataset. While prior work has demonstrated various approaches for agents to learn near-optimal policies, these methods encounter limitations when dealing with diverse constraints in complex environments, such as safety constraints. Some of these approaches prioritize goal attainment without considering safety, while others excessively focus on safety at the expense of training efficiency. In this paper, we study the problem of constrained offline GCRL and propose a new method called Recovery-based Supervised Learning (RbSL) to accomplish safety-critical tasks with various goals. To evaluate the method performance, we build a benchmark based on the robot-fetching environment with a randomly positioned obstacle and use expert or random policies to generate an offline dataset. We compare RbSL with three offline GCRL algorithms and one offline safe RL algorithm. As a result, our method outperforms the existing state-of-the-art methods to a large extent. Furthermore, we validate the practicality and effectiveness of RbSL by deploying it on a real Panda manipulator. Code is available at https://github.com/Sunlighted/RbSL.git.

Offline Goal-Conditioned Reinforcement Learning for Safety-Critical Tasks with Recovery Policy

TL;DR

. Empirical results across four obstacle-rich manipulation tasks show that RbSL achieves higher success rates and lower constraint violations than strong offline GCRL baselines, with robust performance across varying data qualities and a successful sim-to-real deployment on a Panda manipulator. The work provides a practical, scalable approach to safe offline GCRL with real-world impact, and releases code for reproducibility.

Abstract

Paper Structure (18 sections, 12 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 12 equations, 5 figures, 2 tables, 1 algorithm.

INTRODUCTION
RELATED WORK
Goal-Conditioned Reinforcement Learning
Offline Safe Reinforcement Learning
PRELIMINARIES
Constrained Goal-augmented Markov Decision Process
Offline Constrained GCRL
METHOD
Overview
Supervised Learning
Recovery Policy
Data Processing
EXPERIMENTS
Environment and Experiment Setup
Baselines
...and 3 more sections

Figures (5)

Figure 1: An overview of Recovery-based Supervised Learning (RbSL). RbSL first samples data from the environment and processes them into two datasets. Then, the agent learns a goal-conditioned policy and a recovery policy. In evaluation, we use the cost Q-value to predict an unsafe state and decide which policy to use.
Figure 2: Recovery policy: we illustrate the recovery policy on a robotic push task. The original policy will choose the shortest path to reach the goal while the obstacle blocks its way and results in failure. In contrast, the recovery policy will correct the action of entering into an unsafe area and plan a safe path to the goal.
Figure 3: Goal-conditioned environments: (a) MuJoCo Gym-Robotics, (b) Panda-Gym, a simulation environment for the real-world experiment, (c) An experiment environment for the Franka Emika Panda robotic arm.
Figure 4: Training curves of PushObstacle, PickAndPlaceObstacle, SlideObstacle in the 0.5-0.5 setting and ReachObstacle in the 0-1 setting. In each figure, the sub-figures show the average number of discounted returns and cost returns per epoch, where the black dotted line shows the safe constraint limit.
Figure 5: Leveraging RbSL, we display the example execution trajectory of push, which involves navigating around an obstacle to reach the goal.

Offline Goal-Conditioned Reinforcement Learning for Safety-Critical Tasks with Recovery Policy

TL;DR

Abstract

Offline Goal-Conditioned Reinforcement Learning for Safety-Critical Tasks with Recovery Policy

Authors

TL;DR

Abstract

Table of Contents

Figures (5)