Table of Contents
Fetching ...

RecoveryChaining: Learning Local Recovery Policies for Robust Manipulation

Shivam Vats, Devesh K. Jha, Maxim Likhachev, Oliver Kroemer, Diego Romeres

TL;DR

RecoveryChaining presents a hierarchical reinforcement learning framework that learns robust recovery policies for multi-step manipulation by leveraging a hybrid action space of primitive actions and temporally extended nominal options that transfer control to model-based controllers. The method includes Failure Discovery to gather failure cases in simulation and Recovery Learning to train policies that steer the robot back to a nominal-controller precondition, using Monte-Carlo estimates of task success as rewards; Lazy RecoveryChaining further speeds learning by using high-precision binary classifiers to lazily invoke nominal options. Across pick-and-place, shelf, and cluttered-shelf domains with sparse rewards, RC and especially Lazy RC achieve significantly higher recovery success than baselines, and the learned policies transfer to a physical robot without additional fine-tuning, demonstrating sim-to-real viability. The work highlights the value of combining model-based controllers with learned recoveries, while noting limitations such as dependence on accurate simulators, initiation-set assumptions, and challenges under strong partial observability.

Abstract

Model-based planners and controllers are commonly used to solve complex manipulation problems as they can efficiently optimize diverse objectives and generalize to long horizon tasks. However, they often fail during deployment due to noisy actuation, partial observability and imperfect models. To enable a robot to recover from such failures, we propose to use hierarchical reinforcement learning to learn a recovery policy. The recovery policy is triggered when a failure is detected based on sensory observations and seeks to take the robot to a state from which it can complete the task using the nominal model-based controllers. Our approach, called RecoveryChaining, uses a hybrid action space, where the model-based controllers are provided as additional \emph{nominal} options which allows the recovery policy to decide how to recover, when to switch to a nominal controller and which controller to switch to even with \emph{sparse rewards}. We evaluate our approach in three multi-step manipulation tasks with sparse rewards, where it learns significantly more robust recovery policies than those learned by baselines. We successfully transfer recovery policies learned in simulation to a physical robot to demonstrate the feasibility of sim-to-real transfer with our method.

RecoveryChaining: Learning Local Recovery Policies for Robust Manipulation

TL;DR

RecoveryChaining presents a hierarchical reinforcement learning framework that learns robust recovery policies for multi-step manipulation by leveraging a hybrid action space of primitive actions and temporally extended nominal options that transfer control to model-based controllers. The method includes Failure Discovery to gather failure cases in simulation and Recovery Learning to train policies that steer the robot back to a nominal-controller precondition, using Monte-Carlo estimates of task success as rewards; Lazy RecoveryChaining further speeds learning by using high-precision binary classifiers to lazily invoke nominal options. Across pick-and-place, shelf, and cluttered-shelf domains with sparse rewards, RC and especially Lazy RC achieve significantly higher recovery success than baselines, and the learned policies transfer to a physical robot without additional fine-tuning, demonstrating sim-to-real viability. The work highlights the value of combining model-based controllers with learned recoveries, while noting limitations such as dependence on accurate simulators, initiation-set assumptions, and challenges under strong partial observability.

Abstract

Model-based planners and controllers are commonly used to solve complex manipulation problems as they can efficiently optimize diverse objectives and generalize to long horizon tasks. However, they often fail during deployment due to noisy actuation, partial observability and imperfect models. To enable a robot to recover from such failures, we propose to use hierarchical reinforcement learning to learn a recovery policy. The recovery policy is triggered when a failure is detected based on sensory observations and seeks to take the robot to a state from which it can complete the task using the nominal model-based controllers. Our approach, called RecoveryChaining, uses a hybrid action space, where the model-based controllers are provided as additional \emph{nominal} options which allows the recovery policy to decide how to recover, when to switch to a nominal controller and which controller to switch to even with \emph{sparse rewards}. We evaluate our approach in three multi-step manipulation tasks with sparse rewards, where it learns significantly more robust recovery policies than those learned by baselines. We successfully transfer recovery policies learned in simulation to a physical robot to demonstrate the feasibility of sim-to-real transfer with our method.

Paper Structure

This paper contains 17 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Due to uncertainty in the grasp of the object, the robot ends up making contact with the shelf during execution resulting in a collision as well as an in-hand slip failure. However, using the proposed recovery chaining framework, the robot recovers from the collision state and then hands over control to the nominal place skill. The learnt recovery also allows the robot to correct the slip by intentionally making contact with the shelf wall. Note that the policy is trained entirely in simulation. [Best viewed in color].
  • Figure 2: We propose an approach to learn robust recovery behaviors on top of given nominal controllers using reinforcement learning that works even with sparse rewards. Here, the robot is trying to place a box on a shelf but accidentally collides with the shelf due to an imprecise grasp. Using our approach, the robot learns a recovery policy from the failure state in a hybrid action space consisting of primitive robot actions and temporally extended nominal options that trigger a sub-sequence of the nominal controllers. The recovery policy is trained to quickly take the robot to the precondition of one of the nominal controllers so that it can transfer control to the nominal controllers to complete the task. Solid arrows indicate actions taken by the robot and dashed arrows other available actions.
  • Figure 3: Representation of a sequence of nominal policies that solve a task specified by a binary function $f_{goal}$. Due to model inaccuracies and stochastic dynamics, the system may deviate from the nominal plan. A failure detector is used to stop the robot before it encounters an irrecoverable failure. However, this state could be outside the preconditions of the nominal policies. Hence, a new recovery policy $\pi^r$ is learned to take the system back on the nominal plan.
  • Figure 4: We use a hybrid action space for reinforcement learning. It consists of both primitive robot actions and nominal options that transfer control to a sequence of nominal policies that can take it to the goal if applied successfully.
  • Figure 5: Comparison of the learning curves of RecoveryChaining (RC), Lazy RC, Pretrained Preconditions (PP) and RL for Recovery (RLR) in pick-place, shelf and cluttered-shelf domains. RC and Lazy RC make consistent progress in learning, with Lazy RC learning faster in 2/3 domains. PP hits a local optimum early in training and is not able to further improve its policy as it is limited by a pretrained reward model. PP does quite poorly on the shelf task due to its partially observable nature. RLR makes no progress in any of the tasks. Results are averaged over 5 different seeds.
  • ...and 4 more figures