Table of Contents
Fetching ...

Learning Explicit Contact for Implicit Reconstruction of Hand-held Objects from Monocular Images

Junxing Hu, Hongwen Zhang, Zerui Chen, Mengcheng Li, Yunlong Wang, Yebin Liu, Zhenan Sun

TL;DR

This work tackles the ill-posed problem of reconstructing hand-held objects from a single RGB image by introducing explicit hand–object contact as a prior for implicit reconstruction. It first predicts 3D contact probabilities on the hand surface via a coarse-to-fine, graph-based transformer framework, then diffuses these discrete contact states into a 3D volume using sparse convolutions to condition an implicit SDF-based object decoder. The approach yields more realistic object meshes, especially in regions in contact with the hand, and achieves state-of-the-art results on HO3D and OakInk with extensive ablations validating its components. The framework is practical for wild driving scenarios due to its monocular input, explicit contact priors, and robust diffusion-based integration with neural implicit representations.

Abstract

Reconstructing hand-held objects from monocular RGB images is an appealing yet challenging task. In this task, contacts between hands and objects provide important cues for recovering the 3D geometry of the hand-held objects. Though recent works have employed implicit functions to achieve impressive progress, they ignore formulating contacts in their frameworks, which results in producing less realistic object meshes. In this work, we explore how to model contacts in an explicit way to benefit the implicit reconstruction of hand-held objects. Our method consists of two components: explicit contact prediction and implicit shape reconstruction. In the first part, we propose a new subtask of directly estimating 3D hand-object contacts from a single image. The part-level and vertex-level graph-based transformers are cascaded and jointly learned in a coarse-to-fine manner for more accurate contact probabilities. In the second part, we introduce a novel method to diffuse estimated contact states from the hand mesh surface to nearby 3D space and leverage diffused contact probabilities to construct the implicit neural representation for the manipulated object. Benefiting from estimating the interaction patterns between the hand and the object, our method can reconstruct more realistic object meshes, especially for object parts that are in contact with hands. Extensive experiments on challenging benchmarks show that the proposed method outperforms the current state of the arts by a great margin. Our code is publicly available at https://junxinghu.github.io/projects/hoi.html.

Learning Explicit Contact for Implicit Reconstruction of Hand-held Objects from Monocular Images

TL;DR

This work tackles the ill-posed problem of reconstructing hand-held objects from a single RGB image by introducing explicit hand–object contact as a prior for implicit reconstruction. It first predicts 3D contact probabilities on the hand surface via a coarse-to-fine, graph-based transformer framework, then diffuses these discrete contact states into a 3D volume using sparse convolutions to condition an implicit SDF-based object decoder. The approach yields more realistic object meshes, especially in regions in contact with the hand, and achieves state-of-the-art results on HO3D and OakInk with extensive ablations validating its components. The framework is practical for wild driving scenarios due to its monocular input, explicit contact priors, and robust diffusion-based integration with neural implicit representations.

Abstract

Reconstructing hand-held objects from monocular RGB images is an appealing yet challenging task. In this task, contacts between hands and objects provide important cues for recovering the 3D geometry of the hand-held objects. Though recent works have employed implicit functions to achieve impressive progress, they ignore formulating contacts in their frameworks, which results in producing less realistic object meshes. In this work, we explore how to model contacts in an explicit way to benefit the implicit reconstruction of hand-held objects. Our method consists of two components: explicit contact prediction and implicit shape reconstruction. In the first part, we propose a new subtask of directly estimating 3D hand-object contacts from a single image. The part-level and vertex-level graph-based transformers are cascaded and jointly learned in a coarse-to-fine manner for more accurate contact probabilities. In the second part, we introduce a novel method to diffuse estimated contact states from the hand mesh surface to nearby 3D space and leverage diffused contact probabilities to construct the implicit neural representation for the manipulated object. Benefiting from estimating the interaction patterns between the hand and the object, our method can reconstruct more realistic object meshes, especially for object parts that are in contact with hands. Extensive experiments on challenging benchmarks show that the proposed method outperforms the current state of the arts by a great margin. Our code is publicly available at https://junxinghu.github.io/projects/hoi.html.
Paper Structure (48 sections, 4 equations, 12 figures, 12 tables)

This paper contains 48 sections, 4 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Given an RGB image, the proposed method predicts hand-object contacts and recovers the 3D geometry of the object. The insight is that the contacts could provide effective cues for the hand-held object reconstruction.
  • Figure 2: The overview of learning explicit contact for implicit reconstruction. First, the method estimates hand contact regions given a monocular RGB image. Based on the template hand mesh, part- and vertex-level graph-based transformers are cascaded for accurate predictions. Second, the estimated contact is used to construct the implicit neural representation. An off-the-shelf module is utilized to produce the camera parameters, hand mesh, and initial features. Then, the structured contact codes are generated by anchoring contact probabilities to the hand mesh surface. After sparse convolutions, the contact states on the hand surface are diffused to its nearby 3D space, which facilitates the perception and reconstruction of the manipulated object.
  • Figure 3: Visualization of contact frequency for different hand regions on OakInk yang2022oakink. (a) Part-level contact. (b) Vertex-level contact.
  • Figure 4: Visualizations of contact prediction on OakInk (Rows 1, 3) and HO3D (Rows 2, 4) datasets. Since the method only estimates contact, the result is rendered on the ground truth hand mesh. For samples whose contact regions are occluded by hands, hand meshes are rotated 180 degrees for clear visualization. The proposed method is robust to both hand and object occlusions.
  • Figure 5: Qualitative comparison with the state-of-the-art method on the HO3D and OakInk datasets. Our method can reconstruct more realistic objects, especially for parts that are in contact with hands.
  • ...and 7 more figures