Table of Contents
Fetching ...

D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction

Bowen Fu, Gu Wang, Chenyangguang Zhang, Yan Di, Ziqin Huang, Zhiying Leng, Fabian Manhardt, Xiangyang Ji, Federico Tombari

TL;DR

This work introduces centroid-fixed dual-stream conditional diffusion for monocular hand-held object reconstruction (D-SCo), tackling two predominant challenges and introduces a dual-stream denoiser to semantically and geometrically model hand-object interactions with a novel unified hand-object semantic embedding.

Abstract

Reconstructing hand-held objects from a single RGB image is a challenging task in computer vision. In contrast to prior works that utilize deterministic modeling paradigms, we employ a point cloud denoising diffusion model to account for the probabilistic nature of this problem. In the core, we introduce centroid-fixed dual-stream conditional diffusion for monocular hand-held object reconstruction (D-SCo), tackling two predominant challenges. First, to avoid the object centroid from deviating, we utilize a novel hand-constrained centroid fixing paradigm, enhancing the stability of diffusion and reverse processes and the precision of feature projection. Second, we introduce a dual-stream denoiser to semantically and geometrically model hand-object interactions with a novel unified hand-object semantic embedding, enhancing the reconstruction performance of the hand-occluded region of the object. Experiments on the synthetic ObMan dataset and three real-world datasets HO3D, MOW and DexYCB demonstrate that our approach can surpass all other state-of-the-art methods.

D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction

TL;DR

This work introduces centroid-fixed dual-stream conditional diffusion for monocular hand-held object reconstruction (D-SCo), tackling two predominant challenges and introduces a dual-stream denoiser to semantically and geometrically model hand-object interactions with a novel unified hand-object semantic embedding.

Abstract

Reconstructing hand-held objects from a single RGB image is a challenging task in computer vision. In contrast to prior works that utilize deterministic modeling paradigms, we employ a point cloud denoising diffusion model to account for the probabilistic nature of this problem. In the core, we introduce centroid-fixed dual-stream conditional diffusion for monocular hand-held object reconstruction (D-SCo), tackling two predominant challenges. First, to avoid the object centroid from deviating, we utilize a novel hand-constrained centroid fixing paradigm, enhancing the stability of diffusion and reverse processes and the precision of feature projection. Second, we introduce a dual-stream denoiser to semantically and geometrically model hand-object interactions with a novel unified hand-object semantic embedding, enhancing the reconstruction performance of the hand-occluded region of the object. Experiments on the synthetic ObMan dataset and three real-world datasets HO3D, MOW and DexYCB demonstrate that our approach can surpass all other state-of-the-art methods.
Paper Structure (13 sections, 11 equations, 4 figures, 4 tables)

This paper contains 13 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison between D-SCo and naive diffusion models for hand-held object reconstruction. Naive diffusion models are conditioned only on image features without controlling object centroid deviation or modeling the uncertainty induced by hand occlusion. D-SCo, however, keeps the object centroid fixed under the constraint of the hand, making the diffusion model focus on shape reconstruction, and utilizes a dual-stream architecture to individually process semantic and geometric priors to learn a suitable representation for their own domain, tackling the aforementioned problems.
  • Figure 2: Architecture of D-SCo. (I) Given a single-view RGB image, we first predict the hand pose $\phi_H$ and camera view $\phi_C$ by an off-the-shelf network. (II) The object centroid $\widehat{\mathcal{M}}$ is then estimated by our simple yet efficient hand-constrained centroid prediction network. (III) We further introduce a centroid-fixed diffusion network, which always keeps the centroid of partially denoised point cloud fixed at the predicted centroid $\widehat{\mathcal{M}}$ during the reverse process. (IV) A dual-stream denoiser is proposed to individually process and then aggregate semantic and geometric hand-object interaction priors as condition. A unified hand-object semantic embedding is introduced to serve as a strong prior of hand-occlusion.
  • Figure 3: Qualitative results on the ObMan hasson_CVPR19_obman dataset. For each method and ground truth, we show the reconstruction results in the camera view (column 1) and a novel view (column 2).
  • Figure 4: Qualitative results on HO3D hampali_CVPR20_HO3D (top) and MOW cao2021MOW (middle and bottom) datasets. For each method and ground truth, we show the reconstruction results in the camera view (column 1) and a novel view (column 2). We also show the unoccluded objects for our method and ground truth.