Table of Contents
Fetching ...

RESPRECT: Speeding-up Multi-fingered Grasping with Residual Reinforcement Learning

Federico Ceola, Lorenzo Rosasco, Lorenzo Natale

TL;DR

RESPRECT tackles the data inefficiency of dexterous grasping with multi-fingered hands by introducing Residual Reinforcement Learning on top of a pre-trained DRL policy. The method trains a residual policy that adds to the pre-trained action, with residual critics initialized from the base policy to speed up learning, enabling about a 5x speed-up and eliminating task demonstrations. It demonstrates strong performance in MuJoCo-iCub simulations and real-robot experiments on the iCub, achieving comparable success to G-PAYN with significantly fewer timesteps and making real-world learning feasible. The approach combines visually rich MAE-based features, tactile sensing, and proprioception, and shows practical potential for rapid adaptation to unseen objects in dexterous manipulation. This work advances fast, demonstration-free adaptation for complex robotic grasping and highlights remaining areas for improvement in failure-reactivity and object pose tracking.

Abstract

Deep Reinforcement Learning (DRL) has proven effective in learning control policies using robotic grippers, but much less practical for solving the problem of grasping with dexterous hands -- especially on real robotic platforms -- due to the high dimensionality of the problem. In this work, we focus on the multi-fingered grasping task with the anthropomorphic hand of the iCub humanoid. We propose the RESidual learning with PREtrained CriTics (RESPRECT) method that, starting from a policy pre-trained on a large set of objects, can learn a residual policy to grasp a novel object in a fraction ($\sim 5 \times$ faster) of the timesteps required to train a policy from scratch, without requiring any task demonstration. To our knowledge, this is the first Residual Reinforcement Learning (RRL) approach that learns a residual policy on top of another policy pre-trained with DRL. We exploit some components of the pre-trained policy during residual learning that further speed-up the training. We benchmark our results in the iCub simulated environment, and we show that RESPRECT can be effectively used to learn a multi-fingered grasping policy on the real iCub robot. The code to reproduce the experiments is released together with the paper with an open source license.

RESPRECT: Speeding-up Multi-fingered Grasping with Residual Reinforcement Learning

TL;DR

RESPRECT tackles the data inefficiency of dexterous grasping with multi-fingered hands by introducing Residual Reinforcement Learning on top of a pre-trained DRL policy. The method trains a residual policy that adds to the pre-trained action, with residual critics initialized from the base policy to speed up learning, enabling about a 5x speed-up and eliminating task demonstrations. It demonstrates strong performance in MuJoCo-iCub simulations and real-robot experiments on the iCub, achieving comparable success to G-PAYN with significantly fewer timesteps and making real-world learning feasible. The approach combines visually rich MAE-based features, tactile sensing, and proprioception, and shows practical potential for rapid adaptation to unseen objects in dexterous manipulation. This work advances fast, demonstration-free adaptation for complex robotic grasping and highlights remaining areas for improvement in failure-reactivity and object pose tracking.

Abstract

Deep Reinforcement Learning (DRL) has proven effective in learning control policies using robotic grippers, but much less practical for solving the problem of grasping with dexterous hands -- especially on real robotic platforms -- due to the high dimensionality of the problem. In this work, we focus on the multi-fingered grasping task with the anthropomorphic hand of the iCub humanoid. We propose the RESidual learning with PREtrained CriTics (RESPRECT) method that, starting from a policy pre-trained on a large set of objects, can learn a residual policy to grasp a novel object in a fraction ( faster) of the timesteps required to train a policy from scratch, without requiring any task demonstration. To our knowledge, this is the first Residual Reinforcement Learning (RRL) approach that learns a residual policy on top of another policy pre-trained with DRL. We exploit some components of the pre-trained policy during residual learning that further speed-up the training. We benchmark our results in the iCub simulated environment, and we show that RESPRECT can be effectively used to learn a multi-fingered grasping policy on the real iCub robot. The code to reproduce the experiments is released together with the paper with an open source license.
Paper Structure (13 sections, 10 figures, 1 table)

This paper contains 13 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: RESPRECT overview. We compute state $s_t$ from RGB images at timesteps $t$, $t-1$ and $t-2$ (processed through the MAE in radosavovic2023real and combined with Flare shang2021reinforcement), end-effector cartesian pose, tactile information and finger joint poses. We compute action $a_t$ (composed of cartesian offsets for the end-effector and finger joint offsets) combining the outputs $a_{PRE, t}$ of the pre-trained policy and $a_{RL, t}$ of the residual policy. Note that $a_{RL, t}$ is the output of the residual policy Actor, given the concatenation of $s_t$ and $a_{PRE, t}$ into $s_{RL, t}$. We train only the two $2048$-dimensional fully connected layers in the residual Actor and Critics. For the latter, we start from the Critics weigths of the pre-trained policy (orange outline). For the sake of clarity, we do not report the input of the Critics in the pre-trained policy, and the output of both the Critics being the same as the ones in SAC haarnoja2018soft.
  • Figure 2: We compare the success rate obtained with different visual backbones (MAE and CLIP) when learning with G-PAYN ceola2023gpayn on the MSO dataset for $2M$ timesteps. We report in separate plots the cases in which we use Superquadrics and VGN for the initial grasp pose synthesis. We also report the average success rate of the Demonstrations used to initialize the G-PAYN replay buffer. Note that they slightly differ for CLIP and MAE due to the random initialization of each episode to collect demonstrations.
  • Figure 3: Results. We compare the success rate achieved by RESPRECT to the baselines for $1M$ environment timesteps. We benchmark the performance over seven YCB-Video objects (on different columns) starting from grasp poses generated either by Superquadrics or VGN (on different rows).
  • Figure 4: Qualitative evaluation of the proposed RESPRECT. We compare it to the pre-trained policy in the same experiment. We show how the residual output of RESPRECT allows to solve the task.
  • Figure 5: RESPRECT success rate (averaged over the last $30$ training episodes) for increasing training time on the real iCub robot.
  • ...and 5 more figures