Table of Contents
Fetching ...

Sparse View Distractor-Free Gaussian Splatting

Yi Gu, Zhaorui Wang, Jiahang Cao, Jiaxu Wang, Mingle Zhao, Dongjun Ye, Renjing Xu

TL;DR

This work proposes a framework to enhance distractor-free 3DGS under sparse-view conditions by incorporating rich prior information and demonstrates how these priors can be seamlessly integrated into existing distractor-free 3DGS methods.

Abstract

3D Gaussian Splatting (3DGS) enables efficient training and fast novel view synthesis in static environments. To address challenges posed by transient objects, distractor-free 3DGS methods have emerged and shown promising results when dense image captures are available. However, their performance degrades significantly under sparse input conditions. This limitation primarily stems from the reliance on the color residual heuristics to guide the training, which becomes unreliable with limited observations. In this work, we propose a framework to enhance distractor-free 3DGS under sparse-view conditions by incorporating rich prior information. Specifically, we first adopt the geometry foundation model VGGT to estimate camera parameters and generate a dense set of initial 3D points. Then, we harness the attention maps from VGGT for efficient and accurate semantic entity matching. Additionally, we utilize Vision-Language Models (VLMs) to further identify and preserve the large static regions in the scene. We also demonstrate how these priors can be seamlessly integrated into existing distractor-free 3DGS methods. Extensive experiments confirm the effectiveness and robustness of our approach in mitigating transient distractors for sparse-view 3DGS training.

Sparse View Distractor-Free Gaussian Splatting

TL;DR

This work proposes a framework to enhance distractor-free 3DGS under sparse-view conditions by incorporating rich prior information and demonstrates how these priors can be seamlessly integrated into existing distractor-free 3DGS methods.

Abstract

3D Gaussian Splatting (3DGS) enables efficient training and fast novel view synthesis in static environments. To address challenges posed by transient objects, distractor-free 3DGS methods have emerged and shown promising results when dense image captures are available. However, their performance degrades significantly under sparse input conditions. This limitation primarily stems from the reliance on the color residual heuristics to guide the training, which becomes unreliable with limited observations. In this work, we propose a framework to enhance distractor-free 3DGS under sparse-view conditions by incorporating rich prior information. Specifically, we first adopt the geometry foundation model VGGT to estimate camera parameters and generate a dense set of initial 3D points. Then, we harness the attention maps from VGGT for efficient and accurate semantic entity matching. Additionally, we utilize Vision-Language Models (VLMs) to further identify and preserve the large static regions in the scene. We also demonstrate how these priors can be seamlessly integrated into existing distractor-free 3DGS methods. Extensive experiments confirm the effectiveness and robustness of our approach in mitigating transient distractors for sparse-view 3DGS training.
Paper Structure (12 sections, 12 equations, 6 figures, 2 tables)

This paper contains 12 sections, 12 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Illustration of VGGT attention-guided semantic entity matching. Query tokens are highlighted in cyan. Initially, we project these query tokens onto the reference image to obtain the projected tokens. The reprojected tokens are computed in a similar manner. A projected token is considered valid only if its reprojected counterpart lies within the region of the query tokens; otherwise, it is classified as an invalid token (colored in red). To explicitly illustrate the cross-view correspondence matching process, we visualize the global feature maps. Compared to static objects, transient objects typically exhibit lower recall, which serves as a primary criterion for identifying distractors.
  • Figure 2: VLM process illustraion. To simplify annotations, we exclude masks containing fewer than 20000 pixels. For the remaining transient candidate masks, we automatically assign a unique identifier to the center of each mask and highlight each mask with a random color. These operations, in combination with our prompts, significantly enhance the generation of mask priors.
  • Figure 3: Qualitative evaluation of baseline methods and our approach on the NeRF On-the-go and RobustNeRF datasets. * means with the VGGT initialization.
  • Figure 4: Quantitative and qualitative evaluation of transient mask generation. Our method consistently outperforms baseline approaches, delivering more reliable and stable mask predictions.
  • Figure 5: Quantitative and qualitative ablations of each component. We present the raw distractor masks generated by different ablation variants to highlight the effectiveness of our mask-prior-guided warm-up strategy.
  • ...and 1 more figures