Table of Contents
Fetching ...

Wild-GS: Real-Time Novel View Synthesis from Unconstrained Photo Collections

Jiacong Xu, Yiqun Mei, Vishal M. Patel

TL;DR

Wild-GS tackles robust real-time novel view synthesis from unconstrained photo collections by extending 3D Gaussian Splatting with explicit hierarchical appearance modeling. It models global appearance, per-Gaussian local appearance sampled from triplane features, and intrinsic Gaussian properties, while employing depth regularization and transient-object masking to stabilize geometry and color. The approach achieves state-of-the-art rendering quality and superior training/inference efficiency on Phototourism datasets, outperforming NeRF-based methods and prior 3DGS variants. This enables fast, high-fidelity view synthesis and flexible appearance transfer from reference views.

Abstract

Photographs captured in unstructured tourist environments frequently exhibit variable appearances and transient occlusions, challenging accurate scene reconstruction and inducing artifacts in novel view synthesis. Although prior approaches have integrated the Neural Radiance Field (NeRF) with additional learnable modules to handle the dynamic appearances and eliminate transient objects, their extensive training demands and slow rendering speeds limit practical deployments. Recently, 3D Gaussian Splatting (3DGS) has emerged as a promising alternative to NeRF, offering superior training and inference efficiency along with better rendering quality. This paper presents Wild-GS, an innovative adaptation of 3DGS optimized for unconstrained photo collections while preserving its efficiency benefits. Wild-GS determines the appearance of each 3D Gaussian by their inherent material attributes, global illumination and camera properties per image, and point-level local variance of reflectance. Unlike previous methods that model reference features in image space, Wild-GS explicitly aligns the pixel appearance features to the corresponding local Gaussians by sampling the triplane extracted from the reference image. This novel design effectively transfers the high-frequency detailed appearance of the reference view to 3D space and significantly expedites the training process. Furthermore, 2D visibility maps and depth regularization are leveraged to mitigate the transient effects and constrain the geometry, respectively. Extensive experiments demonstrate that Wild-GS achieves state-of-the-art rendering performance and the highest efficiency in both training and inference among all the existing techniques.

Wild-GS: Real-Time Novel View Synthesis from Unconstrained Photo Collections

TL;DR

Wild-GS tackles robust real-time novel view synthesis from unconstrained photo collections by extending 3D Gaussian Splatting with explicit hierarchical appearance modeling. It models global appearance, per-Gaussian local appearance sampled from triplane features, and intrinsic Gaussian properties, while employing depth regularization and transient-object masking to stabilize geometry and color. The approach achieves state-of-the-art rendering quality and superior training/inference efficiency on Phototourism datasets, outperforming NeRF-based methods and prior 3DGS variants. This enables fast, high-fidelity view synthesis and flexible appearance transfer from reference views.

Abstract

Photographs captured in unstructured tourist environments frequently exhibit variable appearances and transient occlusions, challenging accurate scene reconstruction and inducing artifacts in novel view synthesis. Although prior approaches have integrated the Neural Radiance Field (NeRF) with additional learnable modules to handle the dynamic appearances and eliminate transient objects, their extensive training demands and slow rendering speeds limit practical deployments. Recently, 3D Gaussian Splatting (3DGS) has emerged as a promising alternative to NeRF, offering superior training and inference efficiency along with better rendering quality. This paper presents Wild-GS, an innovative adaptation of 3DGS optimized for unconstrained photo collections while preserving its efficiency benefits. Wild-GS determines the appearance of each 3D Gaussian by their inherent material attributes, global illumination and camera properties per image, and point-level local variance of reflectance. Unlike previous methods that model reference features in image space, Wild-GS explicitly aligns the pixel appearance features to the corresponding local Gaussians by sampling the triplane extracted from the reference image. This novel design effectively transfers the high-frequency detailed appearance of the reference view to 3D space and significantly expedites the training process. Furthermore, 2D visibility maps and depth regularization are leveraged to mitigate the transient effects and constrain the geometry, respectively. Extensive experiments demonstrate that Wild-GS achieves state-of-the-art rendering performance and the highest efficiency in both training and inference among all the existing techniques.
Paper Structure (26 sections, 13 equations, 7 figures, 1 table)

This paper contains 26 sections, 13 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Visual comparison between Wild-GS and other existing approaches chen2022hallucinatedyang2023cross. Wild-GS presents superior computational efficiency (tested on single RTX3090), as well as better appearance and geometry reconstruction. Additionally, by modifying the appearance features defined by Wild-GS, one can freely adjust the visual appearance of the entire scene.
  • Figure 2: Overview of the architecture of our proposed Wild-GS. The reference view is first processed by a 2D Parsing module to extract the visibility mask and global appearance embedding. Given the mask and rendered depth from 3DGS, we back-project the 2D reference image without transient objects to the space and construct the static 3D point cloud. Then, these 3D points are re-projected to three predefined orthogonal planes using their normalized coordinates for generation of triplane features. Each 3D Gaussian queries its local appearance embedding by providing the spatial coordinate to the 3D Wrapping module. With the global and local embeddings and the stored intrinsic feature, we can predict the SH coefficients $sh$ of every 3D Gaussian for RGB rasterization.
  • Figure 3: (a) The point cloud from the reference image is projected along three axes and their reverses to generate the triplane color; (b) Illustration of the distribution of the 3D Gaussians on the original triplane and cropped one. Axis-aligned bounding box (AABB) is utilized to accomplish 3D cropping.
  • Figure 4: Visual comparison of rendering quality between different approaches. Red and blue crops mainly emphasize appearance and geometry differences, respectively.
  • Figure 5: Rendering results of ablation study on Wild-GS when removing depth regularization, transient mask (left), and global appearance encoding (right). Red rectangles indicate the areas where geometry is missing or color inconsistency happens. Notations follow Table \ref{['tab:1']}
  • ...and 2 more figures