Table of Contents
Fetching ...

SparseGS-W: Sparse-View 3D Gaussian Splatting in the Wild with Generative Priors

Yiqing Li, Xuan Wang, Jiawei Wu, Yikun Ma, Zhi Jin

TL;DR

SparseGS-W tackles few-shot novel view synthesis for unconstrained in-the-wild outdoor scenes by combining 3D Gaussian Splatting with constrained diffusion priors. The method initializes from dense geometric priors, then uses Constrained Novel-View Enhancement to iteratively refine novel views and Occlusion Handling to remove transient occlusions, guided by appearance control via AdaIn from a reference image. A Progressive Sampling and Training Strategy ensures stable optimization under sparse data, with losses that leverage pseudo ground truths produced by diffusion-based enhancement. Experimental results on PhotoTourism and Tanks and Temples show state-of-the-art performance across both full-reference and non-reference metrics, demonstrating robust reconstruction and occlusion robustness in few-shot, real-world scenarios.

Abstract

Synthesizing novel views of large-scale scenes from unconstrained in-the-wild images is an important but challenging task in computer vision. Existing methods, which optimize per-image appearance and transient occlusion through implicit neural networks from dense training views (approximately 1000 images), struggle to perform effectively under sparse input conditions, resulting in noticeable artifacts. To this end, we propose SparseGS-W, a novel framework based on 3D Gaussian Splatting that enables the reconstruction of complex outdoor scenes and handles occlusions and appearance changes with as few as five training images. We leverage geometric priors and constrained diffusion priors to compensate for the lack of multi-view information from extremely sparse input. Specifically, we propose a plug-and-play Constrained Novel-View Enhancement module to iteratively improve the quality of rendered novel views during the Gaussian optimization process. Furthermore, we propose an Occlusion Handling module, which flexibly removes occlusions utilizing the inherent high-quality inpainting capability of constrained diffusion priors. Both modules are capable of extracting appearance features from any user-provided reference image, enabling flexible modeling of illumination-consistent scenes. Extensive experiments on the PhotoTourism and Tanks and Temples datasets demonstrate that SparseGS-W achieves state-of-the-art performance not only in full-reference metrics, but also in commonly used non-reference metrics such as FID, ClipIQA, and MUSIQ.

SparseGS-W: Sparse-View 3D Gaussian Splatting in the Wild with Generative Priors

TL;DR

SparseGS-W tackles few-shot novel view synthesis for unconstrained in-the-wild outdoor scenes by combining 3D Gaussian Splatting with constrained diffusion priors. The method initializes from dense geometric priors, then uses Constrained Novel-View Enhancement to iteratively refine novel views and Occlusion Handling to remove transient occlusions, guided by appearance control via AdaIn from a reference image. A Progressive Sampling and Training Strategy ensures stable optimization under sparse data, with losses that leverage pseudo ground truths produced by diffusion-based enhancement. Experimental results on PhotoTourism and Tanks and Temples show state-of-the-art performance across both full-reference and non-reference metrics, demonstrating robust reconstruction and occlusion robustness in few-shot, real-world scenarios.

Abstract

Synthesizing novel views of large-scale scenes from unconstrained in-the-wild images is an important but challenging task in computer vision. Existing methods, which optimize per-image appearance and transient occlusion through implicit neural networks from dense training views (approximately 1000 images), struggle to perform effectively under sparse input conditions, resulting in noticeable artifacts. To this end, we propose SparseGS-W, a novel framework based on 3D Gaussian Splatting that enables the reconstruction of complex outdoor scenes and handles occlusions and appearance changes with as few as five training images. We leverage geometric priors and constrained diffusion priors to compensate for the lack of multi-view information from extremely sparse input. Specifically, we propose a plug-and-play Constrained Novel-View Enhancement module to iteratively improve the quality of rendered novel views during the Gaussian optimization process. Furthermore, we propose an Occlusion Handling module, which flexibly removes occlusions utilizing the inherent high-quality inpainting capability of constrained diffusion priors. Both modules are capable of extracting appearance features from any user-provided reference image, enabling flexible modeling of illumination-consistent scenes. Extensive experiments on the PhotoTourism and Tanks and Temples datasets demonstrate that SparseGS-W achieves state-of-the-art performance not only in full-reference metrics, but also in commonly used non-reference metrics such as FID, ClipIQA, and MUSIQ.

Paper Structure

This paper contains 20 sections, 14 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Given only 5 tourism images captured from different views and times, either from a user's photo album or the Internet (a), our method is able to reconstruct the landscape with variable appearances and remove transient occlusions (b). Our method outperforms previous state-of-the-art methods GS-W zhang2024GS-W and WildGaussians kulhanek2024wildgaussians (c).
  • Figure 2: An overviwe of the proposed SpraseGS-W framework. Given unconstrained sparse images and user prompt, we perform dense initialization to obtain the initial point cloud, camera parameters, and occlusion masks. Then, we propose leveraging constrained diffusion priors to iteratively enhance the novel views rendered from the Gaussian radiance field and remove transient occluders.
  • Figure 3: Visualization results of CNVE module. CNVE module can generate high-quality images by fine-tuning on the training views, but it struggles to preserve local image structure (the first pillar on the left in the zoom-in part). Injecting self-attention features helps maintain structure but cannot effectively remove artifacts. By combining these two strategies, CNVE achieves restoring high-fidelity, view-consistent novel views.
  • Figure 4: Qualitative Comparison on PhotoTourism dataset. Under the condition of sparse views, SparseGS-W is able to reconstruct more realistic and detailed scenes with less artifacts and blurring.
  • Figure 5: Qualitative Comparison on Tanks and Temples dataset. Our method outperforms other baselines in the task of few-shot NVS.
  • ...and 2 more figures