A Grasp Pose is All You Need: Learning Multi-fingered Grasping with Deep Reinforcement Learning from Vision and Touch

Federico Ceola; Elisa Maiettini; Lorenzo Rosasco; Lorenzo Natale

A Grasp Pose is All You Need: Learning Multi-fingered Grasping with Deep Reinforcement Learning from Vision and Touch

Federico Ceola, Elisa Maiettini, Lorenzo Rosasco, Lorenzo Natale

TL;DR

The paper tackles the challenge of dexterous multi-fingered grasping with high DoFs by introducing G-PAYN, a DRL-based method for the iCub hand that leverages automatically collected demonstrations and an external grasp pose to bootstrap learning. It trains a policy using Soft Actor-Critic on a rich state that fuses RGB vision, tactile sensing, and proprioception, with a reward that encourages both successful grasping and constructive intermediate states. Empirical results in MuJoCo across five YCB-Video objects show that G-PAYN outperforms strong DRL baselines and often surpasses the demonstration pipeline, achieving faster, more robust grasping and lift actions. The approach reduces the demonstration burden, demonstrates effective sim-to-real transfer potential, and lays groundwork for further speedups and real-world deployment, with code released for reproducibility.

Abstract

Multi-fingered robotic hands have potential to enable robots to perform sophisticated manipulation tasks. However, teaching a robot to grasp objects with an anthropomorphic hand is an arduous problem due to the high dimensionality of state and action spaces. Deep Reinforcement Learning (DRL) offers techniques to design control policies for this kind of problems without explicit environment or hand modeling. However, state-of-the-art model-free algorithms have proven inefficient for learning such policies. The main problem is that the exploration of the environment is unfeasible for such high-dimensional problems, thus hampering the initial phases of policy optimization. One possibility to address this is to rely on off-line task demonstrations, but, oftentimes, this is too demanding in terms of time and computational resources. To address these problems, we propose the A Grasp Pose is All You Need (G-PAYN) method for the anthropomorphic hand of the iCub humanoid. We develop an approach to automatically collect task demonstrations to initialize the training of the policy. The proposed grasping pipeline starts from a grasp pose generated by an external algorithm, used to initiate the movement. Then a control policy (previously trained with the proposed G-PAYN) is used to reach and grab the object. We deployed the iCub into the MuJoCo simulator and use it to test our approach with objects from the YCB-Video dataset. Results show that G-PAYN outperforms current DRL techniques in the considered setting in terms of success rate and execution time with respect to the baselines. The code to reproduce the experiments is released together with the paper with an open source license.

A Grasp Pose is All You Need: Learning Multi-fingered Grasping with Deep Reinforcement Learning from Vision and Touch

TL;DR

Abstract

Paper Structure (16 sections, 1 equation, 6 figures, 1 table)

This paper contains 16 sections, 1 equation, 6 figures, 1 table.

INTRODUCTION
RELATED WORK
Multi-fingered Grasping
Deep Reinforcement Learning from Demonstrations
METHODOLOGY
Grasping Pipeline
Grasp pose computation
Grasp execution
Policy Training
EXPERIMENTAL SETUP
Simulated Environment
Training Hyperparameters
RESULTS
Baselines
Discussion
...and 1 more sections

Figures (6)

Figure 1: iCub simulated environment.
Figure 2: Overview of the proposed grasping pipeline. We rely on a Grasp Pose Generator to compute a suitable grasp pose for the considered object. Then, we move the end-effector of the robot to a Pre-Grasp Pose close to the previously generated grasp pose. Finally, we use a DRL Policy to predict cartesian offsets to move the end-effector toward the object and offsets in the joint space of the fingers to grasp it. We repeat this procedure until the grasp is executed.
Figure 3: (a)iCub hand reference frame.(b)VGN grasp transformation for the iCub hand. We rotate the grasp pose generated for the Franka Emika Panda gripper by $45$° to obtain the corresponding grasp pose for the iCub hand.
Figure 4: Results. We compare the considered methods for grasping execution on different objects and different grasp pose generators. In the first row we consider the approach based on superquadric modeling proposed in vezzani2017sq for grasp pose generation. In the second row, instead, we use VGN. In each column, we report results for different YCB-Video objects.
Figure 5: Qualitative evaluation. We show examples of our method on two of the five objects considered in the experiments (the 004_sugar_box and the 021_bleach_cleanser) using the approach based on superquadrics for grasp pose generation. We show that the learned policies manage to successfully approach the objects, grasp and uplift them.
...and 1 more figures

A Grasp Pose is All You Need: Learning Multi-fingered Grasping with Deep Reinforcement Learning from Vision and Touch

TL;DR

Abstract

A Grasp Pose is All You Need: Learning Multi-fingered Grasping with Deep Reinforcement Learning from Vision and Touch

Authors

TL;DR

Abstract

Table of Contents

Figures (6)