Table of Contents
Fetching ...

MS-GS: Multi-Appearance Sparse-View 3D Gaussian Splatting in the Wild

Deming Li, Kaiwen Jiang, Yutao Tang, Ravi Ramamoorthi, Rama Chellappa, Cheng Peng

TL;DR

MS-GS tackles the problem of multi-appearance 3D reconstruction from sparse imagery by introducing Semantic Depth Alignment to densify a semantically informed initialization and a set of geometry-guided supervisions via virtual views. It models per-image appearance and per-Gaussian features, enforcing 3D consistency through pixel- and feature-level losses across virtual views to suppress overfitting. Experimental results across three real-world sparse datasets show MS-GS achieving state-of-the-art perceptual quality and detailed, coherent renderings under appearance changes. The approach enables robust, photorealistic 3D reconstructions in-the-wild with limited viewpoints and varying appearances, while maintaining computational efficiency.

Abstract

In-the-wild photo collections often contain limited volumes of imagery and exhibit multiple appearances, e.g., taken at different times of day or seasons, posing significant challenges to scene reconstruction and novel view synthesis. Although recent adaptations of Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) have improved in these areas, they tend to oversmooth and are prone to overfitting. In this paper, we present MS-GS, a novel framework designed with Multi-appearance capabilities in Sparse-view scenarios using 3DGS. To address the lack of support due to sparse initializations, our approach is built on the geometric priors elicited from monocular depth estimations. The key lies in extracting and utilizing local semantic regions with a Structure-from-Motion (SfM) points anchored algorithm for reliable alignment and geometry cues. Then, to introduce multi-view constraints, we propose a series of geometry-guided supervision steps at virtual views in pixel and feature levels to encourage 3D consistency and reduce overfitting. We also introduce a dataset and an in-the-wild experiment setting to set up more realistic benchmarks. We demonstrate that MS-GS achieves photorealistic renderings under various challenging sparse-view and multi-appearance conditions, and outperforms existing approaches significantly across different datasets.

MS-GS: Multi-Appearance Sparse-View 3D Gaussian Splatting in the Wild

TL;DR

MS-GS tackles the problem of multi-appearance 3D reconstruction from sparse imagery by introducing Semantic Depth Alignment to densify a semantically informed initialization and a set of geometry-guided supervisions via virtual views. It models per-image appearance and per-Gaussian features, enforcing 3D consistency through pixel- and feature-level losses across virtual views to suppress overfitting. Experimental results across three real-world sparse datasets show MS-GS achieving state-of-the-art perceptual quality and detailed, coherent renderings under appearance changes. The approach enables robust, photorealistic 3D reconstructions in-the-wild with limited viewpoints and varying appearances, while maintaining computational efficiency.

Abstract

In-the-wild photo collections often contain limited volumes of imagery and exhibit multiple appearances, e.g., taken at different times of day or seasons, posing significant challenges to scene reconstruction and novel view synthesis. Although recent adaptations of Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS) have improved in these areas, they tend to oversmooth and are prone to overfitting. In this paper, we present MS-GS, a novel framework designed with Multi-appearance capabilities in Sparse-view scenarios using 3DGS. To address the lack of support due to sparse initializations, our approach is built on the geometric priors elicited from monocular depth estimations. The key lies in extracting and utilizing local semantic regions with a Structure-from-Motion (SfM) points anchored algorithm for reliable alignment and geometry cues. Then, to introduce multi-view constraints, we propose a series of geometry-guided supervision steps at virtual views in pixel and feature levels to encourage 3D consistency and reduce overfitting. We also introduce a dataset and an in-the-wild experiment setting to set up more realistic benchmarks. We demonstrate that MS-GS achieves photorealistic renderings under various challenging sparse-view and multi-appearance conditions, and outperforms existing approaches significantly across different datasets.

Paper Structure

This paper contains 39 sections, 9 equations, 13 figures, 9 tables, 1 algorithm.

Figures (13)

  • Figure 1: With 20 input views, DNGS and FSGS produce overly smooth rendering in regions lacking support from sparse point cloud initialization. For scenes with multiple appearances and sparse inputs, methods like GS-W and Wild-GS experience large artifacts at novel views. In contrast, our method in Fig. \ref{['ours1']} and \ref{['ours2']} renders details and provides a coherent reconstruction.
  • Figure 2: Overview of our depth prior initialization of MS-GS. Semantic masks and corresponding SfM point depth within each mask are obtained through our SfM-prompted Semantic module, detailed in Section \ref{['sec:method']}. We then align monocular depth to SfM depth for each mask by computing the optimal scale $s^*_M$ and shift $t^*_M$. The point cloud is obtained from the back-projection of aligned depths and corresponding image pixel values to construct 3DGS initialization.
  • Figure 3: Overview of our multi-view geometry-guided supervision of MS-GS. Initialized from our proposed dense point cloud, we first create virtual views between training cameras. A 3D point cloud is back-projected given a training view $I_T$ and its corresponding rendered depth $D_T$, and then forward-projected onto the virtual view to obtain the warped image $I^*_V$ for a pixel loss. The correspondences from $I_T$ to $I^*_V$ are mapped to feature maps extracted from these two images to form a feature loss.
  • Figure 4: Novel view synthesis results when components are added sequentially. Please zoom in if possible for better visualization.
  • Figure 5: Qualitative comparison of novel view synthesis across different datasets. MS-GS (ours) excels at capturing detailed structures and preserving consistent appearance.
  • ...and 8 more figures