Table of Contents
Fetching ...

DreamUp3D: Object-Centric Generative Models for Single-View 3D Scene Understanding and Real-to-Sim Transfer

Yizhe Wu, Haitz Sáez de Ocáriz Borde, Jack Collins, Oiwi Parker Jones, Ingmar Posner

TL;DR

DreamUp3D tackles the problem of real-time, single-view 3D scene understanding by introducing an object-centric generative model that operates on RGB-D input. It combines an IC-SBP-based segmentation stage with per-object shape completion via a shape-GRAF and per-object scene decoders (object- and background-GRAFs) to produce 3D reconstructions and unsupervised 6D pose estimates for each object. The approach is self-supervised and capable of transferring from real to simulation contexts, achieving fast test-time inference and outperforming NeRF-based baselines and prior OC-GFM methods in reconstruction quality and pose robustness, particularly under occlusions. These capabilities have direct implications for real-time robotic manipulation and planning, enabling robust perception and manipulation with minimal multi-view data.

Abstract

3D scene understanding for robotic applications exhibits a unique set of requirements including real-time inference, object-centric latent representation learning, accurate 6D pose estimation and 3D reconstruction of objects. Current methods for scene understanding typically rely on a combination of trained models paired with either an explicit or learnt volumetric representation, all of which have their own drawbacks and limitations. We introduce DreamUp3D, a novel Object-Centric Generative Model (OCGM) designed explicitly to perform inference on a 3D scene informed only by a single RGB-D image. DreamUp3D is a self-supervised model, trained end-to-end, and is capable of segmenting objects, providing 3D object reconstructions, generating object-centric latent representations and accurate per-object 6D pose estimates. We compare DreamUp3D to baselines including NeRFs, pre-trained CLIP-features, ObSurf, and ObPose, in a range of tasks including 3D scene reconstruction, object matching and object pose estimation. Our experiments show that our model outperforms all baselines by a significant margin in real-world scenarios displaying its applicability for 3D scene understanding tasks while meeting the strict demands exhibited in robotics applications.

DreamUp3D: Object-Centric Generative Models for Single-View 3D Scene Understanding and Real-to-Sim Transfer

TL;DR

DreamUp3D tackles the problem of real-time, single-view 3D scene understanding by introducing an object-centric generative model that operates on RGB-D input. It combines an IC-SBP-based segmentation stage with per-object shape completion via a shape-GRAF and per-object scene decoders (object- and background-GRAFs) to produce 3D reconstructions and unsupervised 6D pose estimates for each object. The approach is self-supervised and capable of transferring from real to simulation contexts, achieving fast test-time inference and outperforming NeRF-based baselines and prior OC-GFM methods in reconstruction quality and pose robustness, particularly under occlusions. These capabilities have direct implications for real-time robotic manipulation and planning, enabling robust perception and manipulation with minimal multi-view data.

Abstract

3D scene understanding for robotic applications exhibits a unique set of requirements including real-time inference, object-centric latent representation learning, accurate 6D pose estimation and 3D reconstruction of objects. Current methods for scene understanding typically rely on a combination of trained models paired with either an explicit or learnt volumetric representation, all of which have their own drawbacks and limitations. We introduce DreamUp3D, a novel Object-Centric Generative Model (OCGM) designed explicitly to perform inference on a 3D scene informed only by a single RGB-D image. DreamUp3D is a self-supervised model, trained end-to-end, and is capable of segmenting objects, providing 3D object reconstructions, generating object-centric latent representations and accurate per-object 6D pose estimates. We compare DreamUp3D to baselines including NeRFs, pre-trained CLIP-features, ObSurf, and ObPose, in a range of tasks including 3D scene reconstruction, object matching and object pose estimation. Our experiments show that our model outperforms all baselines by a significant margin in real-world scenarios displaying its applicability for 3D scene understanding tasks while meeting the strict demands exhibited in robotics applications.
Paper Structure (15 sections, 10 equations, 7 figures, 3 tables)

This paper contains 15 sections, 10 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: DreamUp3D interprets a single view RGB-D image in a 3D object-centric manner. It uses the IC-SBP algorithm to cluster the input point cloud into object masks. Then, it infers the object pose and the object-centric latent representation for each detected object. Each object is reconstructed in 3D and then together with the other objects and background component merged to form the entire 3D scene reconstruction. Predictions can then be leveraged for downstream manipulation tasks.
  • Figure 2: Architectural diagram of DreamUp3D. The model is composed of several distinct modules for data preprocessing, scene segmentation, pose estimation and object encoding. See Section \ref{['method']} for details.
  • Figure 3: Example configurations used in the experiments. Panels depict the 7-DoF Franka Panda robot together with YCB objects randomly selected and configured on the tabletop.
  • Figure 4: Given a single view of the scene, DreamUp3D produces full scene reconstructions from arbitrary vantage points. For two examples, (a) and (b), the top two rows show the scene reconstructions for RGB and depth respectively from various viewpoints. The following two rows show RGB and depth reconstructions for individual objects and the background components. Example (a) demonstrates the ability of DreamUp3D to reconstruct the shapes of the cracker box and the the chips despite being partially out of view in the input image.
  • Figure 5: Comparison of point cloud ground truth, NeRF reconstruction using 32 views, and DreamUp3D reconstruction using a single view image as input. In the case of DreamUp3D we superimpose the ground truth (white) to the reconstruction (red). All images are taken from the same view point with the size of the point clouds increased for visual clarity.
  • ...and 2 more figures