Table of Contents
Fetching ...

A Modular Framework for Single-View 3D Reconstruction of Indoor Environments

Yuxiao Li

TL;DR

<3-5 sentence high-level summary> The paper tackles single-view indoor scene 3D reconstruction under heavy occlusions by introducing a modular diffusion-based framework that first predicts complete background and occluded foreground views and then lifts them into 3D. It combines an amodal completion module, a defurnishing inpainting model, a depth fusion strategy (Marigold–Dust3r) with test-time fine-tuning, and a view-space alignment pipeline to ensure accurate scene composition. Extensive experiments on 3D-FRONT demonstrate superior visual quality and reconstruction accuracy over state-of-the-art methods, with ablations confirming the value of each component. The approach promises practical impact for interior design, real estate, and cultural heritage through accessible, high-fidelity single-view reconstructions with strong generalization capabilities.

Abstract

We propose a modular framework for single-view indoor scene 3D reconstruction, where several core modules are powered by diffusion techniques. Traditional approaches for this task often struggle with the complex instance shapes and occlusions inherent in indoor environments. They frequently overshoot by attempting to predict 3D shapes directly from incomplete 2D images, which results in limited reconstruction quality. We aim to overcome this limitation by splitting the process into two steps: first, we employ diffusion-based techniques to predict the complete views of the room background and occluded indoor instances, then transform them into 3D. Our modular framework makes contributions to this field through the following components: an amodal completion module for restoring the full view of occluded instances, an inpainting model specifically trained to predict room layouts, a hybrid depth estimation technique that balances overall geometric accuracy with fine detail expressiveness, and a view-space alignment method that exploits both 2D and 3D cues to ensure precise placement of instances within the scene. This approach effectively reconstructs both foreground instances and the room background from a single image. Extensive experiments on the 3D-Front dataset demonstrate that our method outperforms current state-of-the-art (SOTA) approaches in terms of both visual quality and reconstruction accuracy. The framework holds promising potential for applications in interior design, real estate, and augmented reality.

A Modular Framework for Single-View 3D Reconstruction of Indoor Environments

TL;DR

<3-5 sentence high-level summary> The paper tackles single-view indoor scene 3D reconstruction under heavy occlusions by introducing a modular diffusion-based framework that first predicts complete background and occluded foreground views and then lifts them into 3D. It combines an amodal completion module, a defurnishing inpainting model, a depth fusion strategy (Marigold–Dust3r) with test-time fine-tuning, and a view-space alignment pipeline to ensure accurate scene composition. Extensive experiments on 3D-FRONT demonstrate superior visual quality and reconstruction accuracy over state-of-the-art methods, with ablations confirming the value of each component. The approach promises practical impact for interior design, real estate, and cultural heritage through accessible, high-fidelity single-view reconstructions with strong generalization capabilities.

Abstract

We propose a modular framework for single-view indoor scene 3D reconstruction, where several core modules are powered by diffusion techniques. Traditional approaches for this task often struggle with the complex instance shapes and occlusions inherent in indoor environments. They frequently overshoot by attempting to predict 3D shapes directly from incomplete 2D images, which results in limited reconstruction quality. We aim to overcome this limitation by splitting the process into two steps: first, we employ diffusion-based techniques to predict the complete views of the room background and occluded indoor instances, then transform them into 3D. Our modular framework makes contributions to this field through the following components: an amodal completion module for restoring the full view of occluded instances, an inpainting model specifically trained to predict room layouts, a hybrid depth estimation technique that balances overall geometric accuracy with fine detail expressiveness, and a view-space alignment method that exploits both 2D and 3D cues to ensure precise placement of instances within the scene. This approach effectively reconstructs both foreground instances and the room background from a single image. Extensive experiments on the 3D-Front dataset demonstrate that our method outperforms current state-of-the-art (SOTA) approaches in terms of both visual quality and reconstruction accuracy. The framework holds promising potential for applications in interior design, real estate, and augmented reality.

Paper Structure

This paper contains 36 sections, 2 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of the proposed pipeline. Our method starts with a single image of the indoor scene and, through comprehensive scene understanding, reconstructs the components in 3D. It then integrates both 2D canonical view and 3D points as supervision to compose a complete final reconstruction.
  • Figure 2: Comparison of segmentation methods. Our method provides high-quality complete masks for each instance while providing semantic information.
  • Figure 3: Marigold Test-time Fine-Tuning Schema. This figure is adapted from Original Marigold ke_repurposing_2024 Figure 3. Compared to the original inference scheme, we predict the denoised depth latent $\mathbf{z}_{0}^{(\text{d})}$ directly from noise latent $\mathbf{z}_{t}^{(\text{d})}$, followed by further minimizing the difference between the decoded Marigold depth and the Dust3r depth. The gradient is back-propagated to optimize the scale $\lambda$, shift $\mu$, and the noise latent $\mathbf{z}_{t}^{(\text{d})}$, thereby aligning the output Marigold depth with the precise geometric representation provided by the Dust3r depth.
  • Figure 4: Details of the amodal completion module. We initialize the inpainting mask with the instance masks of neighbouring instances. To refine the inpainting mask, we start by removing pixels that are farther from the camera than the instance being processed. Next, we perform a boundary-touching check, extending the mask to include out-of-frame areas where the instance touches the image boundary. Once the complete inpainting mask is obtained, we use SD2 inpainting to restore the full view of the instance.
  • Figure 5: Overview of the Marigold inpainting fine-tuning protocol. This figure is adapted from Original Marigold ke_repurposing_2024 Figure 2. Compared with the original Marigold, we further include the inpainting mask $\mathbf{M}$, and instead of predicting the depth $\mathbf{d}$, we aim to predict the empty room $\mathbf{Y}$.
  • ...and 7 more figures