Table of Contents
Fetching ...

MOHO: Learning Single-view Hand-held Object Reconstruction with Multi-view Occlusion-Aware Supervision

Chenyangguang Zhang, Guanlong Jiao, Yan Di, Gu Wang, Ziqin Huang, Ruida Zhang, Fabian Manhardt, Bowen Fu, Federico Tombari, Xiangyang Ji

TL;DR

MOHO tackles single-view hand-held object reconstruction under severe occlusion by introducing a synthetic-to-real framework. It pretrains on a large occlusion-free synthetic dataset (SOMVideo) to learn to remove hand-induced occlusion in 3D and 2D spaces, and then finetunes on real-world hand-object videos using amodal-mask-weighted supervision, aided by domain-consistent occlusion-aware features. These features include semantic cues from a pre-trained DINO model, hand-articulated geometric embeddings from MANO-based hand poses, and color features, which condition a geometric volume rendering network to recover complete object geometry and texture. Experiments on HO3D and DexYCB show MOHO surpasses both 3D- and 2D-supervised baselines in geometric reconstruction and novel-view synthesis, highlighting the method's potential for practical robotics and AR/VR applications.

Abstract

Previous works concerning single-view hand-held object reconstruction typically rely on supervision from 3D ground-truth models, which are hard to collect in real world. In contrast, readily accessible hand-object videos offer a promising training data source, but they only give heavily occluded object observations. In this paper, we present a novel synthetic-to-real framework to exploit Multi-view Occlusion-aware supervision from hand-object videos for Hand-held Object reconstruction (MOHO) from a single image, tackling two predominant challenges in such setting: hand-induced occlusion and object's self-occlusion. First, in the synthetic pre-training stage, we render a large-scaled synthetic dataset SOMVideo with hand-object images and multi-view occlusion-free supervisions, adopted to address hand-induced occlusion in both 2D and 3D spaces. Second, in the real-world finetuning stage, MOHO leverages the amodal-mask-weighted geometric supervision to mitigate the unfaithful guidance caused by the hand-occluded supervising views in real world. Moreover, domain-consistent occlusion-aware features are amalgamated in MOHO to resist object's self-occlusion for inferring the complete object shape. Extensive experiments on HO3D and DexYCB datasets demonstrate 2D-supervised MOHO gains superior results against 3D-supervised methods by a large margin.

MOHO: Learning Single-view Hand-held Object Reconstruction with Multi-view Occlusion-Aware Supervision

TL;DR

MOHO tackles single-view hand-held object reconstruction under severe occlusion by introducing a synthetic-to-real framework. It pretrains on a large occlusion-free synthetic dataset (SOMVideo) to learn to remove hand-induced occlusion in 3D and 2D spaces, and then finetunes on real-world hand-object videos using amodal-mask-weighted supervision, aided by domain-consistent occlusion-aware features. These features include semantic cues from a pre-trained DINO model, hand-articulated geometric embeddings from MANO-based hand poses, and color features, which condition a geometric volume rendering network to recover complete object geometry and texture. Experiments on HO3D and DexYCB show MOHO surpasses both 3D- and 2D-supervised baselines in geometric reconstruction and novel-view synthesis, highlighting the method's potential for practical robotics and AR/VR applications.

Abstract

Previous works concerning single-view hand-held object reconstruction typically rely on supervision from 3D ground-truth models, which are hard to collect in real world. In contrast, readily accessible hand-object videos offer a promising training data source, but they only give heavily occluded object observations. In this paper, we present a novel synthetic-to-real framework to exploit Multi-view Occlusion-aware supervision from hand-object videos for Hand-held Object reconstruction (MOHO) from a single image, tackling two predominant challenges in such setting: hand-induced occlusion and object's self-occlusion. First, in the synthetic pre-training stage, we render a large-scaled synthetic dataset SOMVideo with hand-object images and multi-view occlusion-free supervisions, adopted to address hand-induced occlusion in both 2D and 3D spaces. Second, in the real-world finetuning stage, MOHO leverages the amodal-mask-weighted geometric supervision to mitigate the unfaithful guidance caused by the hand-occluded supervising views in real world. Moreover, domain-consistent occlusion-aware features are amalgamated in MOHO to resist object's self-occlusion for inferring the complete object shape. Extensive experiments on HO3D and DexYCB datasets demonstrate 2D-supervised MOHO gains superior results against 3D-supervised methods by a large margin.
Paper Structure (12 sections, 4 equations, 4 figures, 5 tables)

This paper contains 12 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: As a synthetic-to-real framework, MOHO is pre-trained by the rendered occlusion-free supervisions on SOMVideo, and then finetuned by the real-world hand-occluded supervising views. In the inference stage, MOHO generates the photorealistic reconstructed mesh given a single reference view, resisting both hand-induced occlusion and object's self-occlusion.
  • Figure 2: Overview of MOHO.Synthetic-to-real Framework: We pre-train MOHO on the SOMVideo to resist hand-induced occlusion in both 3D (T-1) and 2D (T-2) spaces. The 2D recovered amodal masks are transferred into the real-world finetuning for releasing the incomplete hand-occluded supervisions (T-3). Network: Given a segmented hand-object image as input, the estimated camera pose (I-1) and hand pose (I-2) are initialized by an off-line system rong2020frankmocap. Subsequently, MOHO extracts domain-consistent occlusion-aware features including generic semantic cues (N-1) and hand-articulated geometric embeddings (N-2), as well as color features (N-3) for the volume rendering heads (N-4) to yield the textured mesh reconstruction of the full hand-held object (N-5).
  • Figure 3: Visual illustration of SOMVideo rendered with occlusion-free multi-view supervisions.
  • Figure 4: Visualization of textured meshes reconstructed by several baselines IHOI ye2022s, gSDF chen2023gsdf, SSDNeRF chen2023single and MOHO on HO3D hampali2020honnotate (top) and DexYCB chao2021dexycb (bottom). The reconstruction results are exhibited on the camera view and one novel view.