Table of Contents
Fetching ...

SWAG: Splatting in the Wild images with Appearance-conditioned Gaussians

Hiba Dahmani, Moussab Bennehar, Nathan Piasco, Luis Roldao, Dzmitry Tsishkou

TL;DR

SWAG tackles in-the-wild 3D scene reconstruction by extending 3D Gaussian Splatting with appearance-conditioned colors and image-dependent opacity variations to handle photometric changes and transient occluders. It introduces an image-conditioned color network that uses per-image embeddings and a center-aware hash encoding, and a Binary Concrete-based opacity mechanism to identify and remove transient objects in an unsupervised manner. Across Phototourism and NeRF-OSR benchmarks, SWAG delivers state-of-the-art rendering quality with significantly faster training and real-time rendering compared to prior in-the-wild methods, while enabling appearance transfer and smooth interpolation in the learned appearance space. Ablation studies validate the respective contributions of appearance modeling and transient handling, and analyses quantify the distribution of transient Gaussians, supporting robust static scene reconstruction with occluder removal in unconstrained photo collections.

Abstract

Implicit neural representation methods have shown impressive advancements in learning 3D scenes from unstructured in-the-wild photo collections but are still limited by the large computational cost of volumetric rendering. More recently, 3D Gaussian Splatting emerged as a much faster alternative with superior rendering quality and training efficiency, especially for small-scale and object-centric scenarios. Nevertheless, this technique suffers from poor performance on unstructured in-the-wild data. To tackle this, we extend over 3D Gaussian Splatting to handle unstructured image collections. We achieve this by modeling appearance to seize photometric variations in the rendered images. Additionally, we introduce a new mechanism to train transient Gaussians to handle the presence of scene occluders in an unsupervised manner. Experiments on diverse photo collection scenes and multi-pass acquisition of outdoor landmarks show the effectiveness of our method over prior works achieving state-of-the-art results with improved efficiency.

SWAG: Splatting in the Wild images with Appearance-conditioned Gaussians

TL;DR

SWAG tackles in-the-wild 3D scene reconstruction by extending 3D Gaussian Splatting with appearance-conditioned colors and image-dependent opacity variations to handle photometric changes and transient occluders. It introduces an image-conditioned color network that uses per-image embeddings and a center-aware hash encoding, and a Binary Concrete-based opacity mechanism to identify and remove transient objects in an unsupervised manner. Across Phototourism and NeRF-OSR benchmarks, SWAG delivers state-of-the-art rendering quality with significantly faster training and real-time rendering compared to prior in-the-wild methods, while enabling appearance transfer and smooth interpolation in the learned appearance space. Ablation studies validate the respective contributions of appearance modeling and transient handling, and analyses quantify the distribution of transient Gaussians, supporting robust static scene reconstruction with occluder removal in unconstrained photo collections.

Abstract

Implicit neural representation methods have shown impressive advancements in learning 3D scenes from unstructured in-the-wild photo collections but are still limited by the large computational cost of volumetric rendering. More recently, 3D Gaussian Splatting emerged as a much faster alternative with superior rendering quality and training efficiency, especially for small-scale and object-centric scenarios. Nevertheless, this technique suffers from poor performance on unstructured in-the-wild data. To tackle this, we extend over 3D Gaussian Splatting to handle unstructured image collections. We achieve this by modeling appearance to seize photometric variations in the rendered images. Additionally, we introduce a new mechanism to train transient Gaussians to handle the presence of scene occluders in an unsupervised manner. Experiments on diverse photo collection scenes and multi-pass acquisition of outdoor landmarks show the effectiveness of our method over prior works achieving state-of-the-art results with improved efficiency.
Paper Structure (26 sections, 10 equations, 9 figures, 9 tables)

This paper contains 26 sections, 10 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Given in-the-wild captures (a), our model enables transient objects removal (b) and scene reconstruction with variable appearances (c).
  • Figure 2: SWAG model architecture -- In addition to the typical Gaussians' features, we also optimize a Hash Grid encoding their centers $\mathbf{emb(x)}$, a per-image embedding vector $l_I$ and an MLP. This MLP takes as inputs the Gaussians' colors $c$, the associated image embedding $l_I$, and their encoded centers $\mathbf{emb(x)}$ and outputs an image-dependent color $\mathbf{c^{I}}$ as well as an image-dependent opacity variation parameter $\Delta \alpha^I$. This parameter is set as the location variable of a concrete distribution which we sample to get the opacity variation $\Delta \mathbf{\Tilde{\alpha}^{I}}$. Leveraging this opacity variation across diverse training images enables identifying and excluding transient Gaussians within the scene, as demonstrated by the grey Gaussians.
  • Figure 3: Variance histogram analysis of the Gaussians' opacity variation $\Delta \mathbf{\Tilde{\alpha}^{I}}$ w.r.t training images for a dynamic (i.e. containing transient objects) scene (top, Trevi Fountain from Phototourism Jin_2020); and a static scene (bottom, Locomotive from Tanks&Temples Knapitsch2017TanksAT). The left column views are rendered using only static Gaussians (i.e having $\mathbf{Var} \left [ \Delta\mathbf{\Tilde{\alpha}^{I}} \right ] = 0$) whereas right column views are rendered using all Gaussians.
  • Figure 4: Qualitative experimental results on three real-world scenes from Phototourism Jin_2020.
  • Figure 5: Qualitative experimental results on three real-world scenes from Phototourism Jin_2020 and four NeRF-OSR rudnev2022nerfosr scenes. We demonstrate the capability of SWAG to disentangle between static and transient parts of the scene.
  • ...and 4 more figures