Table of Contents
Fetching ...

Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors

Soumava Paul, Prakhar Kaushik, Alan Yuille

TL;DR

This work addresses reconstructing complete $360^{\circ}$ 3D scenes from sparse, uncalibrated images by combining MASt3R/3DGS initial geometry with an RGBD diffusion prior. The core approach uses a VAE-UNet diffusion framework conditioned by context (CLIP features), geometry (Plücker coordinates), and a pixel-aligned confidence map, with FiLM-based conditioning to enforce multi-view coherence. Key contributions include the diffusion-based RGBD inpainting pipeline, a robust confidence-guided conditioning mechanism, and a depth-aware autoencoder fine-tuning strategy, enabling competitive pose-free reconstruction on MipNeRF360 and DL3DV-10K benchmarks. The method reduces dependency on large-scale multi-view data and demonstrates practical potential for pose-free 360° scene reconstruction with moderate compute requirements.

Abstract

In this work, we introduce a generative approach for pose-free (without camera parameters) reconstruction of 360 scenes from a sparse set of 2D images. Pose-free scene reconstruction from incomplete, pose-free observations is usually regularized with depth estimation or 3D foundational priors. While recent advances have enabled sparse-view reconstruction of large complex scenes (with high degree of foreground and background detail) with known camera poses using view-conditioned generative priors, these methods cannot be directly adapted for the pose-free setting when ground-truth poses are not available during evaluation. To address this, we propose an image-to-image generative model designed to inpaint missing details and remove artifacts in novel view renders and depth maps of a 3D scene. We introduce context and geometry conditioning using Feature-wise Linear Modulation (FiLM) modulation layers as a lightweight alternative to cross-attention and also propose a novel confidence measure for 3D Gaussian splat representations to allow for better detection of these artifacts. By progressively integrating these novel views in a Gaussian-SLAM-inspired process, we achieve a multi-view-consistent 3D representation. Evaluations on the MipNeRF360 and DL3DV-10K benchmark dataset demonstrate that our method surpasses existing pose-free techniques and performs competitively with state-of-the-art posed (precomputed camera parameters are given) reconstruction methods in complex 360 scenes. Our project page provides additional results, videos, and code.

Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors

TL;DR

This work addresses reconstructing complete 3D scenes from sparse, uncalibrated images by combining MASt3R/3DGS initial geometry with an RGBD diffusion prior. The core approach uses a VAE-UNet diffusion framework conditioned by context (CLIP features), geometry (Plücker coordinates), and a pixel-aligned confidence map, with FiLM-based conditioning to enforce multi-view coherence. Key contributions include the diffusion-based RGBD inpainting pipeline, a robust confidence-guided conditioning mechanism, and a depth-aware autoencoder fine-tuning strategy, enabling competitive pose-free reconstruction on MipNeRF360 and DL3DV-10K benchmarks. The method reduces dependency on large-scale multi-view data and demonstrates practical potential for pose-free 360° scene reconstruction with moderate compute requirements.

Abstract

In this work, we introduce a generative approach for pose-free (without camera parameters) reconstruction of 360 scenes from a sparse set of 2D images. Pose-free scene reconstruction from incomplete, pose-free observations is usually regularized with depth estimation or 3D foundational priors. While recent advances have enabled sparse-view reconstruction of large complex scenes (with high degree of foreground and background detail) with known camera poses using view-conditioned generative priors, these methods cannot be directly adapted for the pose-free setting when ground-truth poses are not available during evaluation. To address this, we propose an image-to-image generative model designed to inpaint missing details and remove artifacts in novel view renders and depth maps of a 3D scene. We introduce context and geometry conditioning using Feature-wise Linear Modulation (FiLM) modulation layers as a lightweight alternative to cross-attention and also propose a novel confidence measure for 3D Gaussian splat representations to allow for better detection of these artifacts. By progressively integrating these novel views in a Gaussian-SLAM-inspired process, we achieve a multi-view-consistent 3D representation. Evaluations on the MipNeRF360 and DL3DV-10K benchmark dataset demonstrate that our method surpasses existing pose-free techniques and performs competitively with state-of-the-art posed (precomputed camera parameters are given) reconstruction methods in complex 360 scenes. Our project page provides additional results, videos, and code.

Paper Structure

This paper contains 40 sections, 6 equations, 17 figures, 5 tables, 1 algorithm.

Figures (17)

  • Figure 1: Given sparse pose-free images as input, GScenes reconstructs a 3D scene in 5 minutes by iteratively fusing novel view renders and depth maps with an underlying 3D Gaussian representation. Typical pose-free baselines built with geometric priors struggle with reconstructing $360^{\circ}$ scenes from sparse inputs due to the absence of generative priors. GScenes comprises a latent diffusion model capable of inpainting missing details and removing Gaussian artifacts in novel view renders, thereby enabling generation of full $360^{\circ}$ scenes.
  • Figure 2: Comparison of sparse-view reconstruction methods. Methods are grouped based on their requirement for accurate camera poses, open-source availability, need for generative priors, and applicability to large-scale scene reconstruction.
  • Figure 3: Overview of GScenes. We render 3D Gaussians fitted to our sparse set of $M$ views from a novel viewpoint. The resulting render and depth map have missing regions and Gaussian artifacts, which are rectified by an RGBD image-to-image diffusion model. This then acts as pseudo ground truth to spawn and update 3D Gaussians and satisfy the new view constraints. This process is repeated for several novel views spanning the $360^{\circ}$ scene until the representation becomes multi-view consistent.
  • Figure 4: Camera Trajectory Visualization for Novel View Synthesis in pose-free sparse-view setting.
  • Figure 5: Incorporating context and geometry conditioning in the up blocks of the UNet negatively impacts latent and subsequent image reconstruction.
  • ...and 12 more figures