Table of Contents
Fetching ...

Decompositional Neural Scene Reconstruction with Generative Diffusion Prior

Junfeng Ni, Yu Liu, Ruijie Lu, Zirui Zhou, Song-Chun Zhu, Yixin Chen, Siyuan Huang

TL;DR

DP-Recon presents a decompositional neural scene reconstruction framework that integrates per-object diffusion priors via Score Distillation Sampling to fill underconstrained regions in sparse-view scenes. A novel visibility-guided SDS scheme dynamically adjusts per-pixel loss weights, balancing fidelity to input views with generative completion, thereby improving both geometry and appearance across objects and background. The method yields decomposed object meshes with detailed UV maps suitable for photorealistic rendering and VFX editing, and demonstrates strong improvements on Replica and ScanNet++ with as few as 5–10 views, even outperforming baselines using many more views. Overall, DP-Recon advances 3D reconstruction with object-centric diffusion priors, enabling reliable editing and downstream applications while maintaining fidelity to observed data.

Abstract

Decompositional reconstruction of 3D scenes, with complete shapes and detailed texture of all objects within, is intriguing for downstream applications but remains challenging, particularly with sparse views as input. Recent approaches incorporate semantic or geometric regularization to address this issue, but they suffer significant degradation in underconstrained areas and fail to recover occluded regions. We argue that the key to solving this problem lies in supplementing missing information for these areas. To this end, we propose DP-Recon, which employs diffusion priors in the form of Score Distillation Sampling (SDS) to optimize the neural representation of each individual object under novel views. This provides additional information for the underconstrained areas, but directly incorporating diffusion prior raises potential conflicts between the reconstruction and generative guidance. Therefore, we further introduce a visibility-guided approach to dynamically adjust the per-pixel SDS loss weights. Together these components enhance both geometry and appearance recovery while remaining faithful to input images. Extensive experiments across Replica and ScanNet++ demonstrate that our method significantly outperforms SOTA methods. Notably, it achieves better object reconstruction under 10 views than the baselines under 100 views. Our method enables seamless text-based editing for geometry and appearance through SDS optimization and produces decomposed object meshes with detailed UV maps that support photorealistic Visual effects (VFX) editing. The project page is available at https://dp-recon.github.io/.

Decompositional Neural Scene Reconstruction with Generative Diffusion Prior

TL;DR

DP-Recon presents a decompositional neural scene reconstruction framework that integrates per-object diffusion priors via Score Distillation Sampling to fill underconstrained regions in sparse-view scenes. A novel visibility-guided SDS scheme dynamically adjusts per-pixel loss weights, balancing fidelity to input views with generative completion, thereby improving both geometry and appearance across objects and background. The method yields decomposed object meshes with detailed UV maps suitable for photorealistic rendering and VFX editing, and demonstrates strong improvements on Replica and ScanNet++ with as few as 5–10 views, even outperforming baselines using many more views. Overall, DP-Recon advances 3D reconstruction with object-centric diffusion priors, enabling reliable editing and downstream applications while maintaining fidelity to observed data.

Abstract

Decompositional reconstruction of 3D scenes, with complete shapes and detailed texture of all objects within, is intriguing for downstream applications but remains challenging, particularly with sparse views as input. Recent approaches incorporate semantic or geometric regularization to address this issue, but they suffer significant degradation in underconstrained areas and fail to recover occluded regions. We argue that the key to solving this problem lies in supplementing missing information for these areas. To this end, we propose DP-Recon, which employs diffusion priors in the form of Score Distillation Sampling (SDS) to optimize the neural representation of each individual object under novel views. This provides additional information for the underconstrained areas, but directly incorporating diffusion prior raises potential conflicts between the reconstruction and generative guidance. Therefore, we further introduce a visibility-guided approach to dynamically adjust the per-pixel SDS loss weights. Together these components enhance both geometry and appearance recovery while remaining faithful to input images. Extensive experiments across Replica and ScanNet++ demonstrate that our method significantly outperforms SOTA methods. Notably, it achieves better object reconstruction under 10 views than the baselines under 100 views. Our method enables seamless text-based editing for geometry and appearance through SDS optimization and produces decomposed object meshes with detailed UV maps that support photorealistic Visual effects (VFX) editing. The project page is available at https://dp-recon.github.io/.

Paper Structure

This paper contains 77 sections, 31 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: We propose DP-Recon, which capitalizes on pre-trained diffusion models for complete and decompositional neural scene reconstruction. This approach significantly improves reconstruction quality in less captured regions, where previous methods often struggle. Additionally, our method enables flexible text-based editing of geometry and appearance, as well as photorealistic vfx editing.
  • Figure 2: Overview of DP-Recon. We first use reconstruction loss $\mathcal{L}_{recon}$ for decompositional neural reconstruction, followed by the prior-guided geometry optimization stage that incorporates sds loss $\mathcal{L}_{\text{SDS}}^{g-v}$. We finally export the object meshes and optimize their appearance with $\mathcal{L}_{\text{SDS}}^{a-v}$. The visibility balances the guidance from prior and reconstruction by dynamically adjusting per-pixel sds loss.
  • Figure 3: Qualitative comparison of 10-views reconstruction. We present examples from ScanNet++ yeshwanthliu2023scannetpp and Replica replica19arxiv. In each example, the first row shows the background, the second the full scene, and the third individual objects. We reconstruct more complete and reasonable 3D geometry, especially in less captured and occluded regions, such as the chair behind the table and the background.
  • Figure 4: Qualitative results of novel view synthesis. Our method significantly improves rendering quality, particularly in less captured regions with low visibility, shown in darker colors in the visibility maps, such as the highlighted corner of the wall.
  • Figure 5: Visualized novel view instance masks. Our method can synthesize consistent and complete novel view instance masks.
  • ...and 8 more figures