Offline Goal-Conditioned Reinforcement Learning for Safety-Critical Tasks with Recovery Policy
Chenyang Cao, Zichen Yan, Renhao Lu, Junbo Tan, Xueqian Wang
TL;DR
This work tackles constrained offline goal-conditioned reinforcement learning by introducing Recovery-based Supervised Learning (RbSL), a two-policy framework that jointly optimizes a goal-reaching policy and a recovery policy to satisfy safety constraints. The method leverages hindsight relabeling, OOD action detection, and cost-aware data processing to train efficiently on offline data, switching between policies via a learned cost-Q value $Q_C(s,a,g)$. Empirical results across four obstacle-rich manipulation tasks show that RbSL achieves higher success rates and lower constraint violations than strong offline GCRL baselines, with robust performance across varying data qualities and a successful sim-to-real deployment on a Panda manipulator. The work provides a practical, scalable approach to safe offline GCRL with real-world impact, and releases code for reproducibility.
Abstract
Offline goal-conditioned reinforcement learning (GCRL) aims at solving goal-reaching tasks with sparse rewards from an offline dataset. While prior work has demonstrated various approaches for agents to learn near-optimal policies, these methods encounter limitations when dealing with diverse constraints in complex environments, such as safety constraints. Some of these approaches prioritize goal attainment without considering safety, while others excessively focus on safety at the expense of training efficiency. In this paper, we study the problem of constrained offline GCRL and propose a new method called Recovery-based Supervised Learning (RbSL) to accomplish safety-critical tasks with various goals. To evaluate the method performance, we build a benchmark based on the robot-fetching environment with a randomly positioned obstacle and use expert or random policies to generate an offline dataset. We compare RbSL with three offline GCRL algorithms and one offline safe RL algorithm. As a result, our method outperforms the existing state-of-the-art methods to a large extent. Furthermore, we validate the practicality and effectiveness of RbSL by deploying it on a real Panda manipulator. Code is available at https://github.com/Sunlighted/RbSL.git.
