Leveraging Pretrained Latent Representations for Few-Shot Imitation Learning on a Dexterous Robotic Hand
Davide Liconti, Yasunori Toshimitsu, Robert Katzschmann
TL;DR
We address the challenge of dexterous hand imitation learning by leveraging latent motion representations learned from large-scale, task-agnostic human-hand datasets. The method trains a reconstruction-based VAE to encode short-horizon hand sub-trajectories into a latent space and then learns a transformer-based behavior cloning policy that predicts latent codes from observations, decoding them into action sequences; fingertip IK retargeting maps human hand data to a 23-DOF robot, enabling non-teleoperation demonstrations. The approach reduces the data burden, improves robustness to perceptual and proprioceptive noise, and demonstrates successful real-world transfer on a 23-DOF dexterous hand across tasks like grasping and manipulation. Empirical results in simulation and on hardware show faster convergence, greater resilience to noise, and practical applicability, highlighting the value of latent-space representations for robust, data-efficient imitation learning in manipulation tasks.
Abstract
In the context of imitation learning applied to dexterous robotic hands, the high complexity of the systems makes learning complex manipulation tasks challenging. However, the numerous datasets depicting human hands in various different tasks could provide us with better knowledge regarding human hand motion. We propose a method to leverage multiple large-scale task-agnostic datasets to obtain latent representations that effectively encode motion subtrajectories that we included in a transformer-based behavior cloning method. Our results demonstrate that employing latent representations yields enhanced performance compared to conventional behavior cloning methods, particularly regarding resilience to errors and noise in perception and proprioception. Furthermore, the proposed approach solely relies on human demonstrations, eliminating the need for teleoperation and, therefore, accelerating the data acquisition process. Accurate inverse kinematics for fingertip retargeting ensures precise transfer from human hand data to the robot, facilitating effective learning and deployment of manipulation policies. Finally, the trained policies have been successfully transferred to a real-world 23Dof robotic system.
