Table of Contents
Fetching ...

DiffCAD: Weakly-Supervised Probabilistic CAD Model Retrieval and Alignment from an RGB Image

Daoyi Gao, Dávid Rozenberszki, Stefan Leutenegger, Angela Dai

TL;DR

DiffCAD tackles the problem of CAD model retrieval and alignment from a single RGB image under depth-scale and shape ambiguities. It introduces a cascade of three diffusion models modeling scene scale, object pose via Normalized Object Coordinates, and latent CAD shape retrieval, all trained solely on synthetic data. The method supports multi-hypothesis outputs and demonstrates a 5.9% improvement over fully supervised state-of-the-art on Scan2CAD with 8 hypotheses. This weakly-supervised probabilistic approach enables robust 3D scene understanding from monocular input without real-world CAD annotations.

Abstract

Perceiving 3D structures from RGB images based on CAD model primitives can enable an effective, efficient 3D object-based representation of scenes. However, current approaches rely on supervision from expensive annotations of CAD models associated with real images, and encounter challenges due to the inherent ambiguities in the task -- both in depth-scale ambiguity in monocular perception, as well as inexact matches of CAD database models to real observations. We thus propose DiffCAD, the first weakly-supervised probabilistic approach to CAD retrieval and alignment from an RGB image. We formulate this as a conditional generative task, leveraging diffusion to learn implicit probabilistic models capturing the shape, pose, and scale of CAD objects in an image. This enables multi-hypothesis generation of different plausible CAD reconstructions, requiring only a few hypotheses to characterize ambiguities in depth/scale and inexact shape matches. Our approach is trained only on synthetic data, leveraging monocular depth and mask estimates to enable robust zero-shot adaptation to various real target domains. Despite being trained solely on synthetic data, our multi-hypothesis approach can even surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8 hypotheses.

DiffCAD: Weakly-Supervised Probabilistic CAD Model Retrieval and Alignment from an RGB Image

TL;DR

DiffCAD tackles the problem of CAD model retrieval and alignment from a single RGB image under depth-scale and shape ambiguities. It introduces a cascade of three diffusion models modeling scene scale, object pose via Normalized Object Coordinates, and latent CAD shape retrieval, all trained solely on synthetic data. The method supports multi-hypothesis outputs and demonstrates a 5.9% improvement over fully supervised state-of-the-art on Scan2CAD with 8 hypotheses. This weakly-supervised probabilistic approach enables robust 3D scene understanding from monocular input without real-world CAD annotations.

Abstract

Perceiving 3D structures from RGB images based on CAD model primitives can enable an effective, efficient 3D object-based representation of scenes. However, current approaches rely on supervision from expensive annotations of CAD models associated with real images, and encounter challenges due to the inherent ambiguities in the task -- both in depth-scale ambiguity in monocular perception, as well as inexact matches of CAD database models to real observations. We thus propose DiffCAD, the first weakly-supervised probabilistic approach to CAD retrieval and alignment from an RGB image. We formulate this as a conditional generative task, leveraging diffusion to learn implicit probabilistic models capturing the shape, pose, and scale of CAD objects in an image. This enables multi-hypothesis generation of different plausible CAD reconstructions, requiring only a few hypotheses to characterize ambiguities in depth/scale and inexact shape matches. Our approach is trained only on synthetic data, leveraging monocular depth and mask estimates to enable robust zero-shot adaptation to various real target domains. Despite being trained solely on synthetic data, our multi-hypothesis approach can even surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8 hypotheses.
Paper Structure (54 sections, 7 equations, 9 figures, 11 tables)

This paper contains 54 sections, 7 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Method Overview. To facilitate multi-hypothesis reasoning for CAD model retrieval and alignment to a single image, we employ diffusion modeling over scene scale, object pose, and shape. From an input RGB image, we employ machine-generated estimates of depth and instance segmentation. From the estimated depth, we estimate scene scales with $\Phi_s$. $\Phi_n$ uses the back-projected estimated depth of each detected object to output hypotheses for its Normalized Object Coordinates (NOCs). $\Phi_z$ then uses the estimated NOCs to predict the object shape as a latent vector that can be used for retrieval. Our probabilistic modeling also enables robust real-world CAD retrieval and alignment while training only on synthetic data.
  • Figure 2: Multi-hypothesis nature of NOC. The symmetry in object geometry and the incomplete perception can lead to multiple feasible alignments, which we characterize in our probabilistic, diffusion-based approach.
  • Figure 3: Qualitative Comparison on ScanNet images dai2017scannetavetisyan2019scan2cad. Our weakly-supervised probabilistic approach produces more representative retrieval and alignment, even under strong occlusions (bottom), compared with in-domain supervised methods gumeli2022rocalanger2022sparc.
  • Figure 4: Qualitative Results on ScanNet images. Our probabilistic approach shows multi-feasible sets of object shape and pose pairs given the ambiguities in monocular perception. Left-bottom: The two hypotheses corresponding to the smallest and largest scene scale reconstructions follow possible depth-scale ambiguity from the camera view.
  • Figure 5: Qualitative Results on ARKit images. Our approach presents robust retrieval and alignment to various scenes, reconstructing the scene with multi-feasible sets of object shape and pose pairs given the ambiguities in monocular perception. Dotted: The three hypotheses corresponding to the different scene scales.
  • ...and 4 more figures