Table of Contents
Fetching ...

A Versatile and Differentiable Hand-Object Interaction Representation

Théo Morales, Omid Taheri, Gerard Lacey

TL;DR

This work tackles accurate hand-object interaction (HOI) estimation and synthesis for AR/MR and robotics. It introduces CHOIR, a versatile differentiable HOI representation that encodes object geometry with Basis Point Set distances, hand pose via 32 MANO anchors, and dense contacts as 3D Gaussian distributions. A diffusion model, JointDiffusion, is trained to learn p(d_H,c_H|y) under multiple contexts to refine noisy grasps and synthesize new ones, achieving improvements such as a $5\%$ increase in contact F1 and a $46\%$ reduction in simulation displacement compared to SOTA. The results demonstrate superior contact accuracy and realism over state-of-the-art methods, with a scalable, GPU-friendly approach that supports both refinement and synthesis, and highlight directions for extending to richer geometry representations and temporal dynamics.

Abstract

Synthesizing accurate hands-object interactions (HOI) is critical for applications in Computer Vision, Augmented Reality (AR), and Mixed Reality (MR). Despite recent advances, the accuracy of reconstructed or generated HOI leaves room for refinement. Some techniques have improved the accuracy of dense correspondences by shifting focus from generating explicit contacts to using rich HOI fields. Still, they lack full differentiability or continuity and are tailored to specific tasks. In contrast, we present a Coarse Hand-Object Interaction Representation (CHOIR), a novel, versatile and fully differentiable field for HOI modelling. CHOIR leverages discrete unsigned distances for continuous shape and pose encoding, alongside multivariate Gaussian distributions to represent dense contact maps with few parameters. To demonstrate the versatility of CHOIR we design JointDiffusion, a diffusion model to learn a grasp distribution conditioned on noisy hand-object interactions or only object geometries, for both refinement and synthesis applications. We demonstrate JointDiffusion's improvements over the SOTA in both applications: it increases the contact F1 score by $5\%$ for refinement and decreases the sim. displacement by $46\%$ for synthesis. Our experiments show that JointDiffusion with CHOIR yield superior contact accuracy and physical realism compared to SOTA methods designed for specific tasks. Project page: https://theomorales.com/CHOIR

A Versatile and Differentiable Hand-Object Interaction Representation

TL;DR

This work tackles accurate hand-object interaction (HOI) estimation and synthesis for AR/MR and robotics. It introduces CHOIR, a versatile differentiable HOI representation that encodes object geometry with Basis Point Set distances, hand pose via 32 MANO anchors, and dense contacts as 3D Gaussian distributions. A diffusion model, JointDiffusion, is trained to learn p(d_H,c_H|y) under multiple contexts to refine noisy grasps and synthesize new ones, achieving improvements such as a increase in contact F1 and a reduction in simulation displacement compared to SOTA. The results demonstrate superior contact accuracy and realism over state-of-the-art methods, with a scalable, GPU-friendly approach that supports both refinement and synthesis, and highlight directions for extending to richer geometry representations and temporal dynamics.

Abstract

Synthesizing accurate hands-object interactions (HOI) is critical for applications in Computer Vision, Augmented Reality (AR), and Mixed Reality (MR). Despite recent advances, the accuracy of reconstructed or generated HOI leaves room for refinement. Some techniques have improved the accuracy of dense correspondences by shifting focus from generating explicit contacts to using rich HOI fields. Still, they lack full differentiability or continuity and are tailored to specific tasks. In contrast, we present a Coarse Hand-Object Interaction Representation (CHOIR), a novel, versatile and fully differentiable field for HOI modelling. CHOIR leverages discrete unsigned distances for continuous shape and pose encoding, alongside multivariate Gaussian distributions to represent dense contact maps with few parameters. To demonstrate the versatility of CHOIR we design JointDiffusion, a diffusion model to learn a grasp distribution conditioned on noisy hand-object interactions or only object geometries, for both refinement and synthesis applications. We demonstrate JointDiffusion's improvements over the SOTA in both applications: it increases the contact F1 score by for refinement and decreases the sim. displacement by for synthesis. Our experiments show that JointDiffusion with CHOIR yield superior contact accuracy and physical realism compared to SOTA methods designed for specific tasks. Project page: https://theomorales.com/CHOIR
Paper Structure (26 sections, 13 equations, 14 figures, 17 tables, 1 algorithm)

This paper contains 26 sections, 13 equations, 14 figures, 17 tables, 1 algorithm.

Figures (14)

  • Figure 1: Minimal Python code for the stage 1 TTO loss.
  • Figure 2: Illustration of the cone of tolerance used to determine the raw hand contact weights. For each hand vertex, the weights are the count of object points inside the vertex's cone defined along its normal vector. (Left) The green points on the object's surface are inside the cone, hence contributing to the hand vertex's weight while the grey points do not. (Left & Right) No object points are inside the purple cone: its vertex has a contact weight of 0.
  • Figure 3: Visualization of our probabilistic contact maps (best seen in colour). (a) The raw hand contact weights are computed with our cone of tolerance method. (b) 32 3D Gaussian distributions are fitted -- one for each MANO anchor -- on the weights to obtain contact probability densities. (c) Comparison of the recovered probabilistic dense contact map and of the raw contact weights. Our method leaves gaps in the contact map to allow for a $2mm$ penetration and improve contact fitting.
  • Figure 4: Architecture of JointDiffusion. The 3D U-Net predicts the noise sample $\epsilon_{\boldsymbol{d}}$ for the hand distance field $\boldsymbol{d}_H$. The contact prediction branch predicts the noise sample $\epsilon_{\boldsymbol{c}}$ for the contact Gaussian parameters $\boldsymbol{c}_H$ from the features of the U-Net's bottleneck. This joint learning encourages the U-Net to extract features relevant to both tasks, enhancing the accuracy of the learned CHOIR distribution.
  • Figure 5: Qualitative comparison of grasp denoising on one challenging case of the Peturbed ContactPose benchmark. Our method produces less penetration than TOCHZhou2022TOCHSO, and substantially better output than ContactOptGrady2021ContactOptOC which maximizes hand-object contact.
  • ...and 9 more figures