A Grasp Pose is All You Need: Learning Multi-fingered Grasping with Deep Reinforcement Learning from Vision and Touch
Federico Ceola, Elisa Maiettini, Lorenzo Rosasco, Lorenzo Natale
TL;DR
The paper tackles the challenge of dexterous multi-fingered grasping with high DoFs by introducing G-PAYN, a DRL-based method for the iCub hand that leverages automatically collected demonstrations and an external grasp pose to bootstrap learning. It trains a policy using Soft Actor-Critic on a rich state that fuses RGB vision, tactile sensing, and proprioception, with a reward that encourages both successful grasping and constructive intermediate states. Empirical results in MuJoCo across five YCB-Video objects show that G-PAYN outperforms strong DRL baselines and often surpasses the demonstration pipeline, achieving faster, more robust grasping and lift actions. The approach reduces the demonstration burden, demonstrates effective sim-to-real transfer potential, and lays groundwork for further speedups and real-world deployment, with code released for reproducibility.
Abstract
Multi-fingered robotic hands have potential to enable robots to perform sophisticated manipulation tasks. However, teaching a robot to grasp objects with an anthropomorphic hand is an arduous problem due to the high dimensionality of state and action spaces. Deep Reinforcement Learning (DRL) offers techniques to design control policies for this kind of problems without explicit environment or hand modeling. However, state-of-the-art model-free algorithms have proven inefficient for learning such policies. The main problem is that the exploration of the environment is unfeasible for such high-dimensional problems, thus hampering the initial phases of policy optimization. One possibility to address this is to rely on off-line task demonstrations, but, oftentimes, this is too demanding in terms of time and computational resources. To address these problems, we propose the A Grasp Pose is All You Need (G-PAYN) method for the anthropomorphic hand of the iCub humanoid. We develop an approach to automatically collect task demonstrations to initialize the training of the policy. The proposed grasping pipeline starts from a grasp pose generated by an external algorithm, used to initiate the movement. Then a control policy (previously trained with the proposed G-PAYN) is used to reach and grab the object. We deployed the iCub into the MuJoCo simulator and use it to test our approach with objects from the YCB-Video dataset. Results show that G-PAYN outperforms current DRL techniques in the considered setting in terms of success rate and execution time with respect to the baselines. The code to reproduce the experiments is released together with the paper with an open source license.
