Table of Contents
Fetching ...

Uncertainty-Aware Diffusion Guided Refinement of 3D Scenes

Sarosij Bose, Arindam Dutta, Sayak Nag, Junge Zhang, Jiachen Li, Konstantinos Karydis, Amit K. Roy Chowdhury

TL;DR

This work tackles the ill-posed problem of reconstructing 3D scenes from a single image by refining coarse Gaussians using a camera-controlled Latent Video Diffusion Model (LVDM) to generate pose-consistent pseudo-views. An uncertainty-aware mechanism, driven by MLLM-guided open-vocabulary segmentation, yields per-pixel entropy maps that weight refinement toward trustworthy regions, while Fourier Style Transfer aligns textures between real and generated views. The refinement uses Adaptive Densification and Pruning (ADP) to manage Gaussian density and an uncertainty-weighted reconstruction loss to update Gaussian parameters, producing more realistic and multi-view-consistent novel views. Experiments on RealEstate-10K and KITTI-v2 demonstrate consistent improvements over state-of-the-art feed-forward methods in both interpolation and extrapolation tasks, validating the approach's effectiveness on in-domain and out-domain data without ground-truth supervision.

Abstract

Reconstructing 3D scenes from a single image is a fundamentally ill-posed task due to the severely under-constrained nature of the problem. Consequently, when the scene is rendered from novel camera views, existing single image to 3D reconstruction methods render incoherent and blurry views. This problem is exacerbated when the unseen regions are far away from the input camera. In this work, we address these inherent limitations in existing single image-to-3D scene feedforward networks. To alleviate the poor performance due to insufficient information beyond the input image's view, we leverage a strong generative prior in the form of a pre-trained latent video diffusion model, for iterative refinement of a coarse scene represented by optimizable Gaussian parameters. To ensure that the style and texture of the generated images align with that of the input image, we incorporate on-the-fly Fourier-style transfer between the generated images and the input image. Additionally, we design a semantic uncertainty quantification module that calculates the per-pixel entropy and yields uncertainty maps used to guide the refinement process from the most confident pixels while discarding the remaining highly uncertain ones. We conduct extensive experiments on real-world scene datasets, including in-domain RealEstate-10K and out-of-domain KITTI-v2, showing that our approach can provide more realistic and high-fidelity novel view synthesis results compared to existing state-of-the-art methods.

Uncertainty-Aware Diffusion Guided Refinement of 3D Scenes

TL;DR

This work tackles the ill-posed problem of reconstructing 3D scenes from a single image by refining coarse Gaussians using a camera-controlled Latent Video Diffusion Model (LVDM) to generate pose-consistent pseudo-views. An uncertainty-aware mechanism, driven by MLLM-guided open-vocabulary segmentation, yields per-pixel entropy maps that weight refinement toward trustworthy regions, while Fourier Style Transfer aligns textures between real and generated views. The refinement uses Adaptive Densification and Pruning (ADP) to manage Gaussian density and an uncertainty-weighted reconstruction loss to update Gaussian parameters, producing more realistic and multi-view-consistent novel views. Experiments on RealEstate-10K and KITTI-v2 demonstrate consistent improvements over state-of-the-art feed-forward methods in both interpolation and extrapolation tasks, validating the approach's effectiveness on in-domain and out-domain data without ground-truth supervision.

Abstract

Reconstructing 3D scenes from a single image is a fundamentally ill-posed task due to the severely under-constrained nature of the problem. Consequently, when the scene is rendered from novel camera views, existing single image to 3D reconstruction methods render incoherent and blurry views. This problem is exacerbated when the unseen regions are far away from the input camera. In this work, we address these inherent limitations in existing single image-to-3D scene feedforward networks. To alleviate the poor performance due to insufficient information beyond the input image's view, we leverage a strong generative prior in the form of a pre-trained latent video diffusion model, for iterative refinement of a coarse scene represented by optimizable Gaussian parameters. To ensure that the style and texture of the generated images align with that of the input image, we incorporate on-the-fly Fourier-style transfer between the generated images and the input image. Additionally, we design a semantic uncertainty quantification module that calculates the per-pixel entropy and yields uncertainty maps used to guide the refinement process from the most confident pixels while discarding the remaining highly uncertain ones. We conduct extensive experiments on real-world scene datasets, including in-domain RealEstate-10K and out-of-domain KITTI-v2, showing that our approach can provide more realistic and high-fidelity novel view synthesis results compared to existing state-of-the-art methods.

Paper Structure

This paper contains 13 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview: We introduce UAR-Scenes, a diffusion-guided refinement pipeline that enhances outputs from pre-trained single-image to 3D scene reconstruction models, such as Flash3D szymanowicz2024flash3d, which yield imperfect renderings (red box) under slight viewpoint variations (2nd column from left). By harnessing the generative power of a latent video diffusion model (LVDM) blattmann2023stable, our approach can sample plausible explanations for unseen regions (green box) to produce refined, high-quality novel views of 3D scenes (3rd column from left).
  • Figure 2: Workflow of UAR-Scenes. For a conditioning image $\mathcal{I}$, a pre-trained 3D reconstruction model $\mathcal{F(\cdot)}$ produces coarse gaussians ($\gamma_n$) representing the scene $\phi$, represented by optimizable gaussian parameters. Using temporally consistent pseudo 2D supervisory images ($\widetilde{\mathcal{I}}_{\text{p}}$) sampled from the pre-trained camera extrinsic embedded LVDM model wang2024motionctrl, we iteratively refine the gaussians $\gamma_n$ using Adaptive Densification and Pruning (ADP)kerbl20233d to obtain clean gaussians ($\gamma^{'}_n$). To gauge the uncertainty of each pixel $\textbf{p}$ in the generated pseudo images ($\widetilde{\mathcal{I}}_{\text{p}}$), we propose a semantic uncertainty quantification method. We estimate the entropy present in each ($\widetilde{\mathcal{I}}_{\text{p}}$) obtained by utilizing an off-the-shelf open-vocabulary segmentation model $\mathcal{S}$li2022languagedriven using which we obtain uncertainty maps $\mathcal{U}$ (as shown in \ref{['subsec:coarse_init']}). We take the hadamard product between $\widetilde{\mathcal{I}}_{\text{p}}$ and $\mathcal{U}$ forming the target objective for the refinement loss in \ref{['eq:uw_fst__recon']}, which guides the refinement process.
  • Figure 3: Uncertainty Map Estimation. On the right is the obtained uncertainty map from Pseudo Ground truth Image $\widetilde{\mathcal{I}}_{\text{p}}$ (on the left). It is generated by the LVDM after applying FST yang2020fda. These maps are crucial for guiding the Gaussian refinement process to progressively focus more on the confident (blue and green regions) and not on blurry ceiling and stairs (red).
  • Figure 4: Qualitative Results. (a) Qualitative comparisons on the RealEstate-10K dataset shows that our method produces more realistic results which are more plausible and faithful to the original image (In 1st row, Flash3D's renderings are blurry outside the camera's seen frustum where UAR-Scenes is able to complete the window. Similarly UAR-Scenes can provide reasonable completions which may not always align with the GT as shown in 2nd row). (b) Qualitative comparison on the KITTI-v2 dataset which shows that our method can deliver sharp results especially in edges where there may be ambiguity (Back edge of car is distorted in Flash3D's prediction in 2nd column). Notice that despite the significant camera motion between the original input view and the target novel views, UAR-Scenes can render realistic and plausible renderings as highlighted above.
  • Figure 5: Ablation Results. The leftmost image is the rendered view from the baseline method Flash3D which fails in extrapolation. Next, we have the LVDM generated image which clearly has oversaturated textures which does not align with real world scenes. On the 3rd image from the left, FST alleviates this issue by performing style alignment which leads to better quality results in the final output on the right.