Table of Contents
Fetching ...

Learning Continuous Grasping Function with a Dexterous Hand from Human Demonstrations

Jianglong Ye, Jiashun Wang, Binghao Huang, Yuzhe Qin, Xiaolong Wang

TL;DR

The paper tackles dexterous grasping with a dexterous hand by learning a continuous-time grasping function from human demonstrations. It introduces CGF, an implicit-function within a Conditional Variational Autoencoder that maps object geometry, time, and latent codes to robot hand trajectories, enabling dense time-sampled plans and diverse outputs. After translating human trajectories to Allegro hand demonstrations, CGF is trained to reconstruct these motions and then used to generate trajectories that are tested in simulation and deployed on real hardware, achieving improved sim-to-real transfer and generalization to unseen objects. The results show superior trajectory smoothness, lower planning cost, and higher real-world success rates compared to two-step planning baselines, validating the approach's efficiency and practicality.

Abstract

We propose to learn to generate grasping motion for manipulation with a dexterous hand using implicit functions. With continuous time inputs, the model can generate a continuous and smooth grasping plan. We name the proposed model Continuous Grasping Function (CGF). CGF is learned via generative modeling with a Conditional Variational Autoencoder using 3D human demonstrations. We will first convert the large-scale human-object interaction trajectories to robot demonstrations via motion retargeting, and then use these demonstrations to train CGF. During inference, we perform sampling with CGF to generate different grasping plans in the simulator and select the successful ones to transfer to the real robot. By training on diverse human data, our CGF allows generalization to manipulate multiple objects. Compared to previous planning algorithms, CGF is more efficient and achieves significant improvement on success rate when transferred to grasping with the real Allegro Hand. Our project page is available at https://jianglongye.com/cgf .

Learning Continuous Grasping Function with a Dexterous Hand from Human Demonstrations

TL;DR

The paper tackles dexterous grasping with a dexterous hand by learning a continuous-time grasping function from human demonstrations. It introduces CGF, an implicit-function within a Conditional Variational Autoencoder that maps object geometry, time, and latent codes to robot hand trajectories, enabling dense time-sampled plans and diverse outputs. After translating human trajectories to Allegro hand demonstrations, CGF is trained to reconstruct these motions and then used to generate trajectories that are tested in simulation and deployed on real hardware, achieving improved sim-to-real transfer and generalization to unseen objects. The results show superior trajectory smoothness, lower planning cost, and higher real-world success rates compared to two-step planning baselines, validating the approach's efficiency and practicality.

Abstract

We propose to learn to generate grasping motion for manipulation with a dexterous hand using implicit functions. With continuous time inputs, the model can generate a continuous and smooth grasping plan. We name the proposed model Continuous Grasping Function (CGF). CGF is learned via generative modeling with a Conditional Variational Autoencoder using 3D human demonstrations. We will first convert the large-scale human-object interaction trajectories to robot demonstrations via motion retargeting, and then use these demonstrations to train CGF. During inference, we perform sampling with CGF to generate different grasping plans in the simulator and select the successful ones to transfer to the real robot. By training on diverse human data, our CGF allows generalization to manipulate multiple objects. Compared to previous planning algorithms, CGF is more efficient and achieves significant improvement on success rate when transferred to grasping with the real Allegro Hand. Our project page is available at https://jianglongye.com/cgf .
Paper Structure (14 sections, 4 equations, 6 figures, 5 tables)

This paper contains 14 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Examples of our generated trajectories learned from human demonstrations. Given hand-object trajectories from human video (a), we first translate them into robot manipulation demonstrations (b). We then train Continuous Grasping Function (CGF) to generate human-like trajectories and deploy them in simulation (c) and real robot (d).
  • Figure 2: Pipeline overview. During training, human demonstrations are first translated to robot joint positions which serve as the supervision for grasping function learning. During inference, our trained CGF takes a sampled latent code $z$, object feature and query time sequence as inputs to generate the trajectory. We then execute these trajectories in the simulator and deploy successful ones to the real robot.
  • Figure 3: Network architecture. Our generative model takes object point cloud and a sequence of joint positions as input and recovers corresponding robot hands. The proposed CGF takes the latent code $z$, object feature, and the query time $t$ as inputs to predict the corresponding joint position $\hat{q_t}$. $\oplus$ denotes concatenation.
  • Figure 4: Qualitative evaluation in the simulation. Because of human demonstrations, our CGF generates a more natural and reasonable trajectory, which is helpful for the sim-to-real transfer. G is short for GraspTTA.
  • Figure 5: Grasping interpolation. We show the first frame and the last frame of the grasping trajectory. Yellow lines indicate the trajectory of the palm joint. Our method produces diverse grasping and the interpolation between them is also plausible. To the best of our knowledge, this result on interpolating both robot hand grasping pose and trajectory has not been shown before.
  • ...and 1 more figures