A Versatile and Differentiable Hand-Object Interaction Representation
Théo Morales, Omid Taheri, Gerard Lacey
TL;DR
This work tackles accurate hand-object interaction (HOI) estimation and synthesis for AR/MR and robotics. It introduces CHOIR, a versatile differentiable HOI representation that encodes object geometry with Basis Point Set distances, hand pose via 32 MANO anchors, and dense contacts as 3D Gaussian distributions. A diffusion model, JointDiffusion, is trained to learn p(d_H,c_H|y) under multiple contexts to refine noisy grasps and synthesize new ones, achieving improvements such as a $5\%$ increase in contact F1 and a $46\%$ reduction in simulation displacement compared to SOTA. The results demonstrate superior contact accuracy and realism over state-of-the-art methods, with a scalable, GPU-friendly approach that supports both refinement and synthesis, and highlight directions for extending to richer geometry representations and temporal dynamics.
Abstract
Synthesizing accurate hands-object interactions (HOI) is critical for applications in Computer Vision, Augmented Reality (AR), and Mixed Reality (MR). Despite recent advances, the accuracy of reconstructed or generated HOI leaves room for refinement. Some techniques have improved the accuracy of dense correspondences by shifting focus from generating explicit contacts to using rich HOI fields. Still, they lack full differentiability or continuity and are tailored to specific tasks. In contrast, we present a Coarse Hand-Object Interaction Representation (CHOIR), a novel, versatile and fully differentiable field for HOI modelling. CHOIR leverages discrete unsigned distances for continuous shape and pose encoding, alongside multivariate Gaussian distributions to represent dense contact maps with few parameters. To demonstrate the versatility of CHOIR we design JointDiffusion, a diffusion model to learn a grasp distribution conditioned on noisy hand-object interactions or only object geometries, for both refinement and synthesis applications. We demonstrate JointDiffusion's improvements over the SOTA in both applications: it increases the contact F1 score by $5\%$ for refinement and decreases the sim. displacement by $46\%$ for synthesis. Our experiments show that JointDiffusion with CHOIR yield superior contact accuracy and physical realism compared to SOTA methods designed for specific tasks. Project page: https://theomorales.com/CHOIR
