Table of Contents
Fetching ...

G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis

Yufei Ye, Abhinav Gupta, Kris Kitani, Shubham Tulsiani

TL;DR

G-HOP introduces a 3D diffusion-based prior that jointly models hand and object interactions by representing HOI as a unified interaction grid comprising a latent object SDF and a skeletal distance field for the hand. Trained on seven diverse HOI datasets spanning 155 object categories, the model enables both 3D generation and prior-guided inference, including video-based reconstruction of interaction clips and synthesis of plausible human grasps, through score distillation-based guidance. The approach achieves state-of-the-art or competitive results against task-specific baselines and provides faster convergence for reconstruction by leveraging a strong 3D prior. This work advances the scalability of hand-object interaction understanding and offers a versatile prior for downstream robotic and perception tasks.

Abstract

We propose G-HOP, a denoising diffusion based generative prior for hand-object interactions that allows modeling both the 3D object and a human hand, conditioned on the object category. To learn a 3D spatial diffusion model that can capture this joint distribution, we represent the human hand via a skeletal distance field to obtain a representation aligned with the (latent) signed distance field for the object. We show that this hand-object prior can then serve as generic guidance to facilitate other tasks like reconstruction from interaction clip and human grasp synthesis. We believe that our model, trained by aggregating seven diverse real-world interaction datasets spanning across 155 categories, represents a first approach that allows jointly generating both hand and object. Our empirical evaluations demonstrate the benefit of this joint prior in video-based reconstruction and human grasp synthesis, outperforming current task-specific baselines. Project website: https://judyye.github.io/ghop-www

G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis

TL;DR

G-HOP introduces a 3D diffusion-based prior that jointly models hand and object interactions by representing HOI as a unified interaction grid comprising a latent object SDF and a skeletal distance field for the hand. Trained on seven diverse HOI datasets spanning 155 object categories, the model enables both 3D generation and prior-guided inference, including video-based reconstruction of interaction clips and synthesis of plausible human grasps, through score distillation-based guidance. The approach achieves state-of-the-art or competitive results against task-specific baselines and provides faster convergence for reconstruction by leveraging a strong 3D prior. This work advances the scalability of hand-object interaction understanding and offers a versatile prior for downstream robotic and perception tasks.

Abstract

We propose G-HOP, a denoising diffusion based generative prior for hand-object interactions that allows modeling both the 3D object and a human hand, conditioned on the object category. To learn a 3D spatial diffusion model that can capture this joint distribution, we represent the human hand via a skeletal distance field to obtain a representation aligned with the (latent) signed distance field for the object. We show that this hand-object prior can then serve as generic guidance to facilitate other tasks like reconstruction from interaction clip and human grasp synthesis. We believe that our model, trained by aggregating seven diverse real-world interaction datasets spanning across 155 categories, represents a first approach that allows jointly generating both hand and object. Our empirical evaluations demonstrate the benefit of this joint prior in video-based reconstruction and human grasp synthesis, outperforming current task-specific baselines. Project website: https://judyye.github.io/ghop-www
Paper Structure (48 sections, 4 equations, 18 figures, 10 tables)

This paper contains 48 sections, 4 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: G-HOP can generate plausible hand-object interactions across a wide variety of objects (top). The learned generative prior can also guide inference for tasks such as reconstructing everyday interaction clips and synthesizing human grasps given object meshes.
  • Figure 2: Method Overview of Generative Hand-Object Prior: Hand-object interactions are represented as interaction grids within the diffusion model. This interaction grid concatenates the (latent) signed distance field for object and skeletal distance field for the hand. Given a noisy interaction grid and a text prompt, our diffusion model predicts a denoised grid. To extract 3D shape of HOI from the interaction grid, we use decoder to decode object latent code and run gradient descent on hand field to extract hand pose parameters.
  • Figure 2: Comparison with Baselines: We compare our synthesised human grasps against GraspTTA jiang2021hand and annotated grasps provided by datasets (GT) on HO3D and 3DW. We report table the intersection between meshes, displacement distance in simulation, and hand contact ratio and area (top). We also report preference percentages from users for pairwise method comparison on HO3D and 3DW (bottom).
  • Figure 3: Reconstructing Interaction Clips: We parameterize HOI scene as object implicit field, hand pose, and their relative transformation (left). The scene parameters are optimized with respect to the SDS loss on extracted interaction grid and reprojection loss (right).
  • Figure 4: Grasp Synthesis: We parameterize human grasps via hand articulation parameters and the relative hand-object transformation (left). These are optimized with respect to SDS loss by converting grasp (and known shape) to interaction grid (right).
  • ...and 13 more figures