A Modular Framework for Single-View 3D Reconstruction of Indoor Environments
Yuxiao Li
TL;DR
<3-5 sentence high-level summary> The paper tackles single-view indoor scene 3D reconstruction under heavy occlusions by introducing a modular diffusion-based framework that first predicts complete background and occluded foreground views and then lifts them into 3D. It combines an amodal completion module, a defurnishing inpainting model, a depth fusion strategy (Marigold–Dust3r) with test-time fine-tuning, and a view-space alignment pipeline to ensure accurate scene composition. Extensive experiments on 3D-FRONT demonstrate superior visual quality and reconstruction accuracy over state-of-the-art methods, with ablations confirming the value of each component. The approach promises practical impact for interior design, real estate, and cultural heritage through accessible, high-fidelity single-view reconstructions with strong generalization capabilities.
Abstract
We propose a modular framework for single-view indoor scene 3D reconstruction, where several core modules are powered by diffusion techniques. Traditional approaches for this task often struggle with the complex instance shapes and occlusions inherent in indoor environments. They frequently overshoot by attempting to predict 3D shapes directly from incomplete 2D images, which results in limited reconstruction quality. We aim to overcome this limitation by splitting the process into two steps: first, we employ diffusion-based techniques to predict the complete views of the room background and occluded indoor instances, then transform them into 3D. Our modular framework makes contributions to this field through the following components: an amodal completion module for restoring the full view of occluded instances, an inpainting model specifically trained to predict room layouts, a hybrid depth estimation technique that balances overall geometric accuracy with fine detail expressiveness, and a view-space alignment method that exploits both 2D and 3D cues to ensure precise placement of instances within the scene. This approach effectively reconstructs both foreground instances and the room background from a single image. Extensive experiments on the 3D-Front dataset demonstrate that our method outperforms current state-of-the-art (SOTA) approaches in terms of both visual quality and reconstruction accuracy. The framework holds promising potential for applications in interior design, real estate, and augmented reality.
