Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors
Soumava Paul, Prakhar Kaushik, Alan Yuille
TL;DR
This work addresses reconstructing complete $360^{\circ}$ 3D scenes from sparse, uncalibrated images by combining MASt3R/3DGS initial geometry with an RGBD diffusion prior. The core approach uses a VAE-UNet diffusion framework conditioned by context (CLIP features), geometry (Plücker coordinates), and a pixel-aligned confidence map, with FiLM-based conditioning to enforce multi-view coherence. Key contributions include the diffusion-based RGBD inpainting pipeline, a robust confidence-guided conditioning mechanism, and a depth-aware autoencoder fine-tuning strategy, enabling competitive pose-free reconstruction on MipNeRF360 and DL3DV-10K benchmarks. The method reduces dependency on large-scale multi-view data and demonstrates practical potential for pose-free 360° scene reconstruction with moderate compute requirements.
Abstract
In this work, we introduce a generative approach for pose-free (without camera parameters) reconstruction of 360 scenes from a sparse set of 2D images. Pose-free scene reconstruction from incomplete, pose-free observations is usually regularized with depth estimation or 3D foundational priors. While recent advances have enabled sparse-view reconstruction of large complex scenes (with high degree of foreground and background detail) with known camera poses using view-conditioned generative priors, these methods cannot be directly adapted for the pose-free setting when ground-truth poses are not available during evaluation. To address this, we propose an image-to-image generative model designed to inpaint missing details and remove artifacts in novel view renders and depth maps of a 3D scene. We introduce context and geometry conditioning using Feature-wise Linear Modulation (FiLM) modulation layers as a lightweight alternative to cross-attention and also propose a novel confidence measure for 3D Gaussian splat representations to allow for better detection of these artifacts. By progressively integrating these novel views in a Gaussian-SLAM-inspired process, we achieve a multi-view-consistent 3D representation. Evaluations on the MipNeRF360 and DL3DV-10K benchmark dataset demonstrate that our method surpasses existing pose-free techniques and performs competitively with state-of-the-art posed (precomputed camera parameters are given) reconstruction methods in complex 360 scenes. Our project page provides additional results, videos, and code.
