Table of Contents
Fetching ...

Coherent 3D Scene Diffusion From a Single RGB Image

Manuel Dahnert, Angela Dai, Norman Müller, Matthias Nießner

TL;DR

This work tackles single-view 3D scene reconstruction by casting it as a conditional diffusion process that jointly infers all objects' poses and geometries. It introduces a novel intra-scene attention-based diffusion prior to model inter-object relationships and a surface-alignment loss that leverages an expressive intermediate shape representation to enable training with partial ground-truth. The method achieves state-of-the-art results on SUN RGB-D and Pix3D, significantly improving both 3D scene reconstruction metrics and single-object shape quality, while demonstrating generalization to unseen indoor data and enabling unconditional shape synthesis. These advances offer stronger, more coherent 3D scene understanding from monocular input, with potential implications for robotics, AR/VR content creation, and immersive environments.

Abstract

We present a novel diffusion-based approach for coherent 3D scene reconstruction from a single RGB image. Our method utilizes an image-conditioned 3D scene diffusion model to simultaneously denoise the 3D poses and geometries of all objects within the scene. Motivated by the ill-posed nature of the task and to obtain consistent scene reconstruction results, we learn a generative scene prior by conditioning on all scene objects simultaneously to capture the scene context and by allowing the model to learn inter-object relationships throughout the diffusion process. We further propose an efficient surface alignment loss to facilitate training even in the absence of full ground-truth annotation, which is common in publicly available datasets. This loss leverages an expressive shape representation, which enables direct point sampling from intermediate shape predictions. By framing the task of single RGB image 3D scene reconstruction as a conditional diffusion process, our approach surpasses current state-of-the-art methods, achieving a 12.04% improvement in AP3D on SUN RGB-D and a 13.43% increase in F-Score on Pix3D.

Coherent 3D Scene Diffusion From a Single RGB Image

TL;DR

This work tackles single-view 3D scene reconstruction by casting it as a conditional diffusion process that jointly infers all objects' poses and geometries. It introduces a novel intra-scene attention-based diffusion prior to model inter-object relationships and a surface-alignment loss that leverages an expressive intermediate shape representation to enable training with partial ground-truth. The method achieves state-of-the-art results on SUN RGB-D and Pix3D, significantly improving both 3D scene reconstruction metrics and single-object shape quality, while demonstrating generalization to unseen indoor data and enabling unconditional shape synthesis. These advances offer stronger, more coherent 3D scene understanding from monocular input, with potential implications for robotics, AR/VR content creation, and immersive environments.

Abstract

We present a novel diffusion-based approach for coherent 3D scene reconstruction from a single RGB image. Our method utilizes an image-conditioned 3D scene diffusion model to simultaneously denoise the 3D poses and geometries of all objects within the scene. Motivated by the ill-posed nature of the task and to obtain consistent scene reconstruction results, we learn a generative scene prior by conditioning on all scene objects simultaneously to capture the scene context and by allowing the model to learn inter-object relationships throughout the diffusion process. We further propose an efficient surface alignment loss to facilitate training even in the absence of full ground-truth annotation, which is common in publicly available datasets. This loss leverages an expressive shape representation, which enables direct point sampling from intermediate shape predictions. By framing the task of single RGB image 3D scene reconstruction as a conditional diffusion process, our approach surpasses current state-of-the-art methods, achieving a 12.04% improvement in AP3D on SUN RGB-D and a 13.43% increase in F-Score on Pix3D.

Paper Structure

This paper contains 49 sections, 8 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Given a single RGB image of an indoor scene, our model reconstructs the 3D scene by jointly estimating object arrangements and shapes in a globally consistent manner. Our novel diffusion-based 3D scene reconstruction approach achieves highly accurate predictions by utilizing a novel generative scene prior that captures scene context and inter-object relationships, and by employing an efficient surface alignment loss formulation for joint pose- and shape-synthesis.
  • Figure 2: Scene Prior and Surface Alignment Loss Overview. (Left) We propose a novel way to model scene priors (\ref{['subsubsec:isa']}) by modeling the scene context and the relationships between all objects during the denoising process. (Right) For additional supervision and joint training, we use a surface alignment loss (\ref{['subsec:alignment']}) between a given ground truth depth map and point samples directly drawn from the intermediate shape representation $\hat{\sigma}_i$ and transformed to camera space with the predicted object pose $\hat{\rho}_i$.
  • Figure 3: Qualitative comparison of 3D scene reconstruction on SUN RGB-D song2015sun. While the baselines often produce noisy or incomplete shape reconstruction of intersecting or misplaced objects, our method produces plausible object arrangements as well as high-quality shape reconstructions.
  • Figure 4: Inference results on ScanNet dai2017scannet. We use our model trained on SUN RGB-D song2015sun and perform inference on individual frames of ScanNet without fine-tuning. We observe strong generalization capabilities with respect to different camera parameters and scene arrangements.
  • Figure 5: Unconditional results. Injecting $\varnothing$ as a condition to our conditional diffusion model, i.e., effectively disabling the conditioning mechanism, results in high-quality and diverse results.
  • ...and 7 more figures