Table of Contents
Fetching ...

Leveraging Pretrained Latent Representations for Few-Shot Imitation Learning on a Dexterous Robotic Hand

Davide Liconti, Yasunori Toshimitsu, Robert Katzschmann

TL;DR

We address the challenge of dexterous hand imitation learning by leveraging latent motion representations learned from large-scale, task-agnostic human-hand datasets. The method trains a reconstruction-based VAE to encode short-horizon hand sub-trajectories into a latent space and then learns a transformer-based behavior cloning policy that predicts latent codes from observations, decoding them into action sequences; fingertip IK retargeting maps human hand data to a 23-DOF robot, enabling non-teleoperation demonstrations. The approach reduces the data burden, improves robustness to perceptual and proprioceptive noise, and demonstrates successful real-world transfer on a 23-DOF dexterous hand across tasks like grasping and manipulation. Empirical results in simulation and on hardware show faster convergence, greater resilience to noise, and practical applicability, highlighting the value of latent-space representations for robust, data-efficient imitation learning in manipulation tasks.

Abstract

In the context of imitation learning applied to dexterous robotic hands, the high complexity of the systems makes learning complex manipulation tasks challenging. However, the numerous datasets depicting human hands in various different tasks could provide us with better knowledge regarding human hand motion. We propose a method to leverage multiple large-scale task-agnostic datasets to obtain latent representations that effectively encode motion subtrajectories that we included in a transformer-based behavior cloning method. Our results demonstrate that employing latent representations yields enhanced performance compared to conventional behavior cloning methods, particularly regarding resilience to errors and noise in perception and proprioception. Furthermore, the proposed approach solely relies on human demonstrations, eliminating the need for teleoperation and, therefore, accelerating the data acquisition process. Accurate inverse kinematics for fingertip retargeting ensures precise transfer from human hand data to the robot, facilitating effective learning and deployment of manipulation policies. Finally, the trained policies have been successfully transferred to a real-world 23Dof robotic system.

Leveraging Pretrained Latent Representations for Few-Shot Imitation Learning on a Dexterous Robotic Hand

TL;DR

We address the challenge of dexterous hand imitation learning by leveraging latent motion representations learned from large-scale, task-agnostic human-hand datasets. The method trains a reconstruction-based VAE to encode short-horizon hand sub-trajectories into a latent space and then learns a transformer-based behavior cloning policy that predicts latent codes from observations, decoding them into action sequences; fingertip IK retargeting maps human hand data to a 23-DOF robot, enabling non-teleoperation demonstrations. The approach reduces the data burden, improves robustness to perceptual and proprioceptive noise, and demonstrates successful real-world transfer on a 23-DOF dexterous hand across tasks like grasping and manipulation. Empirical results in simulation and on hardware show faster convergence, greater resilience to noise, and practical applicability, highlighting the value of latent-space representations for robust, data-efficient imitation learning in manipulation tasks.

Abstract

In the context of imitation learning applied to dexterous robotic hands, the high complexity of the systems makes learning complex manipulation tasks challenging. However, the numerous datasets depicting human hands in various different tasks could provide us with better knowledge regarding human hand motion. We propose a method to leverage multiple large-scale task-agnostic datasets to obtain latent representations that effectively encode motion subtrajectories that we included in a transformer-based behavior cloning method. Our results demonstrate that employing latent representations yields enhanced performance compared to conventional behavior cloning methods, particularly regarding resilience to errors and noise in perception and proprioception. Furthermore, the proposed approach solely relies on human demonstrations, eliminating the need for teleoperation and, therefore, accelerating the data acquisition process. Accurate inverse kinematics for fingertip retargeting ensures precise transfer from human hand data to the robot, facilitating effective learning and deployment of manipulation policies. Finally, the trained policies have been successfully transferred to a real-world 23Dof robotic system.
Paper Structure (22 sections, 2 equations, 13 figures, 1 algorithm)

This paper contains 22 sections, 2 equations, 13 figures, 1 algorithm.

Figures (13)

  • Figure 1: Compared with traditional behavior cloning approaches, this work proposes a few-shot end-to-end pipeline that uses pre-trained latent representations learned from multiple large-scale task-agnostic datasets. This latent space effectively encodes robot actions, giving benefits regarding robustness and stability of the learned policies. Moreover, manipulation demonstrations are acquired without teleoperation, and then deployed on a real dexterous robotic system.
  • Figure 2: Our imitation learning method leverages latent representations of valid human hand motion, pre-strained on large-scale datasets. Step 1: We train a reconstruction-based VAE to learn a latent space representation of hand motion by leveraging multiple large task-agnostic datasets, reducing the dimension needed to encode hand subtrajectories. Step 2: We collect a small dataset of recorded demonstrations with a data acquisition pipelines that tracks object and poses, and train a behavior policy that outputs latent representations of the hand trajectories, which are then decoded with the pretrained decoder.
  • Figure 3: Data collection for the small task-specific demonstrations. A motion capture glove captures the finger motions, and visual markers are used for tracking the wrist and object poses. With this pipeline the data acquisition can be considerably sped up with respect to teleoperation-based methods.
  • Figure 4: Visual results of IK retargeting. The red dots represent the 3D positions of the fingertips coming from the motion capture glove. Due to the 16Dof of the hand, we can mimic very extreme poses, including pure finger abduction.
  • Figure 5: Analysis and reconstruction of input demonstrations, where the input is retargeted into the robot state space, projected into a latent space for smooth trajectory analysis thanks to KL divergence loss, and finally reconstructed with high fidelity to the original demonstration. a) Retargeted input demonstration, sampled every 7 frames. b) Latent space projection (PCA) of the subtrajectories of the input demonstration, with a sliding window of size 1. We can notice that the latent space trajectory is smooth and continuous, due to the KL divergence loss term. c) Reconstruction of the input demonstration, obtained by taking all the frames of the first latent space representation, and the last frame of all the subsequent decoded subtrajectories. We can notice a good reconstruction by comparing this visualization with the input demonstration.
  • ...and 8 more figures