Table of Contents
Fetching ...

Gen3DSR: Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View

Andreea Ardelean, Mert Özer, Bernhard Egger

TL;DR

Gen3DSR introduces a modular divide-and-conquer pipeline for reconstructing 3D scenes from a single image without end-to-end 3D supervision. The method first analyzes the scene holistically to produce depth, camera, and segmentation information, then reconstructs each object with a diffusion-prior-based single-view method enhanced by amodal completion, followed by assembling the results into a coherent scene and modeling the background. Its key contributions are a compositional framework that can be incrementally improved by swapping modules, a learned amodal completion component, and a robust reprojection/linking strategy that aligns object reconstructions to scene depth. Empirically, Gen3DSR achieves competitive or superior results on synthetic 3D-FRONT and real HOPE-Image data, including challenging real-world scenes, while maintaining zero-shot generalization.

Abstract

Single-view 3D reconstruction is currently approached from two dominant perspectives: reconstruction of scenes with limited diversity using 3D data supervision or reconstruction of diverse singular objects using large image priors. However, real-world scenarios are far more complex and exceed the capabilities of these methods. We therefore propose a hybrid method following a divide-and-conquer strategy. We first process the scene holistically, extracting depth and semantic information, and then leverage an object-level method for the detailed reconstruction of individual components. By splitting the problem into simpler tasks, our system is able to generalize to various types of scenes without retraining or fine-tuning. We purposely design our pipeline to be highly modular with independent, self-contained modules, to avoid the need for end-to-end training of the whole system. This enables the pipeline to naturally improve as future methods can replace the individual modules. We demonstrate the reconstruction performance of our approach on both synthetic and real-world scenes, comparing favorable against prior works. Project page: https://andreeadogaru.github.io/Gen3DSR

Gen3DSR: Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View

TL;DR

Gen3DSR introduces a modular divide-and-conquer pipeline for reconstructing 3D scenes from a single image without end-to-end 3D supervision. The method first analyzes the scene holistically to produce depth, camera, and segmentation information, then reconstructs each object with a diffusion-prior-based single-view method enhanced by amodal completion, followed by assembling the results into a coherent scene and modeling the background. Its key contributions are a compositional framework that can be incrementally improved by swapping modules, a learned amodal completion component, and a robust reprojection/linking strategy that aligns object reconstructions to scene depth. Empirically, Gen3DSR achieves competitive or superior results on synthetic 3D-FRONT and real HOPE-Image data, including challenging real-world scenes, while maintaining zero-shot generalization.

Abstract

Single-view 3D reconstruction is currently approached from two dominant perspectives: reconstruction of scenes with limited diversity using 3D data supervision or reconstruction of diverse singular objects using large image priors. However, real-world scenarios are far more complex and exceed the capabilities of these methods. We therefore propose a hybrid method following a divide-and-conquer strategy. We first process the scene holistically, extracting depth and semantic information, and then leverage an object-level method for the detailed reconstruction of individual components. By splitting the problem into simpler tasks, our system is able to generalize to various types of scenes without retraining or fine-tuning. We purposely design our pipeline to be highly modular with independent, self-contained modules, to avoid the need for end-to-end training of the whole system. This enables the pipeline to naturally improve as future methods can replace the individual modules. We demonstrate the reconstruction performance of our approach on both synthetic and real-world scenes, comparing favorable against prior works. Project page: https://andreeadogaru.github.io/Gen3DSR
Paper Structure (25 sections, 15 figures, 5 tables)

This paper contains 25 sections, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Our method can reconstruct a 3D scene from a single view. We identify distinct objects, address their occlusions through amodal completion, and reconstruct them individually. The resulting 3D objects are composed into the scene using monocular depth guides. Each component is reconstructed as a triangle mesh, enabling downstream applications such as scene manipulation and editing.
  • Figure 2: Method Overview: the input image is first analyzed collectively by an ensemble of state-of-the-art monocular models. Subsequently, the identified instances are individually processed, as elaborated in Figure \ref{['fig:obj_rec']}. The reconstructed objects, along with the modeled background, are composed into the final scene, which can then be used in various applications.
  • Figure 3: Detailed Overview over the processing steps of each instance. An image is processed through the scene analysis part of our framework as described in Figure \ref{['fig:pipeline']}. Then, we add an object recognition information for diffusion guided completion to restore partially occluded objects. Lastly, we perform reconstruction and align the result back to the input view space.
  • Figure 4: Qualitative results on the 3D-FRONT dataset front20213d. The methods considered reconstruct full scenes including background regions and foreground instances.
  • Figure 5: Qualitative results on 3D-FRONT front20213d illustrated under the input view and a second one chosen to highlight the reconstruction performance. In contrast to Ours and InstPIFu, DreamGaussian reconstructs all the objects at once and does not model the background.
  • ...and 10 more figures