G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis
Yufei Ye, Abhinav Gupta, Kris Kitani, Shubham Tulsiani
TL;DR
G-HOP introduces a 3D diffusion-based prior that jointly models hand and object interactions by representing HOI as a unified interaction grid comprising a latent object SDF and a skeletal distance field for the hand. Trained on seven diverse HOI datasets spanning 155 object categories, the model enables both 3D generation and prior-guided inference, including video-based reconstruction of interaction clips and synthesis of plausible human grasps, through score distillation-based guidance. The approach achieves state-of-the-art or competitive results against task-specific baselines and provides faster convergence for reconstruction by leveraging a strong 3D prior. This work advances the scalability of hand-object interaction understanding and offers a versatile prior for downstream robotic and perception tasks.
Abstract
We propose G-HOP, a denoising diffusion based generative prior for hand-object interactions that allows modeling both the 3D object and a human hand, conditioned on the object category. To learn a 3D spatial diffusion model that can capture this joint distribution, we represent the human hand via a skeletal distance field to obtain a representation aligned with the (latent) signed distance field for the object. We show that this hand-object prior can then serve as generic guidance to facilitate other tasks like reconstruction from interaction clip and human grasp synthesis. We believe that our model, trained by aggregating seven diverse real-world interaction datasets spanning across 155 categories, represents a first approach that allows jointly generating both hand and object. Our empirical evaluations demonstrate the benefit of this joint prior in video-based reconstruction and human grasp synthesis, outperforming current task-specific baselines. Project website: https://judyye.github.io/ghop-www
