Table of Contents
Fetching ...

VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction

Jiaqi Lin, Zhihao Li, Xiao Tang, Jianzhuang Liu, Shiyong Liu, Jiayue Liu, Yangdi Lu, Xiaofei Wu, Songcen Xu, Youliang Yan, Wenming Yang

TL;DR

VastGaussian addresses the scalability gap in NeRF-based large-scene reconstruction by decomposing a large scene into parallel-optimized cells using an airspace-aware visibility criterion, and by introducing decoupled appearance modeling to suppress appearance-induced floaters during optimization. The method merges the independently optimized cells into a seamless, high-fidelity large-scale scene and achieves real-time 1080p rendering with state-of-the-art quality on several large benchmarks. The decoupled appearance module, which uses appearance embeddings and a CNN to learn per-pixel color transformations during training, can be discarded after optimization to preserve rendering speed. Together, these innovations enable fast, scalable, and photorealistic large-scale reconstructions beyond the capabilities of prior NeRF-based approaches.

Abstract

Existing NeRF-based methods for large scene reconstruction often have limitations in visual quality and rendering speed. While the recent 3D Gaussian Splatting works well on small-scale and object-centric scenes, scaling it up to large scenes poses challenges due to limited video memory, long optimization time, and noticeable appearance variations. To address these challenges, we present VastGaussian, the first method for high-quality reconstruction and real-time rendering on large scenes based on 3D Gaussian Splatting. We propose a progressive partitioning strategy to divide a large scene into multiple cells, where the training cameras and point cloud are properly distributed with an airspace-aware visibility criterion. These cells are merged into a complete scene after parallel optimization. We also introduce decoupled appearance modeling into the optimization process to reduce appearance variations in the rendered images. Our approach outperforms existing NeRF-based methods and achieves state-of-the-art results on multiple large scene datasets, enabling fast optimization and high-fidelity real-time rendering.

VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction

TL;DR

VastGaussian addresses the scalability gap in NeRF-based large-scene reconstruction by decomposing a large scene into parallel-optimized cells using an airspace-aware visibility criterion, and by introducing decoupled appearance modeling to suppress appearance-induced floaters during optimization. The method merges the independently optimized cells into a seamless, high-fidelity large-scale scene and achieves real-time 1080p rendering with state-of-the-art quality on several large benchmarks. The decoupled appearance module, which uses appearance embeddings and a CNN to learn per-pixel color transformations during training, can be discarded after optimization to preserve rendering speed. Together, these innovations enable fast, scalable, and photorealistic large-scale reconstructions beyond the capabilities of prior NeRF-based approaches.

Abstract

Existing NeRF-based methods for large scene reconstruction often have limitations in visual quality and rendering speed. While the recent 3D Gaussian Splatting works well on small-scale and object-centric scenes, scaling it up to large scenes poses challenges due to limited video memory, long optimization time, and noticeable appearance variations. To address these challenges, we present VastGaussian, the first method for high-quality reconstruction and real-time rendering on large scenes based on 3D Gaussian Splatting. We propose a progressive partitioning strategy to divide a large scene into multiple cells, where the training cameras and point cloud are properly distributed with an airspace-aware visibility criterion. These cells are merged into a complete scene after parallel optimization. We also introduce decoupled appearance modeling into the optimization process to reduce appearance variations in the rendered images. Our approach outperforms existing NeRF-based methods and achieves state-of-the-art results on multiple large scene datasets, enabling fast optimization and high-fidelity real-time rendering.
Paper Structure (19 sections, 3 equations, 10 figures, 6 tables)

This paper contains 19 sections, 3 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Renderings of three state-of-the-art methods and our VastGaussian from the Residence scene in the UrbanScene3D dataset lin2022capturing. (a, b) Mega-NeRF turki2022mega and Switch-NeRF zhenxing2022switch produce blurry results with slow rendering speeds. (c) We modify 3D Gaussian Splatting (3DGS) kerbl20233d so that it can be optimized for enough iterations on a $32$ GB GPU. The rendered image is much sharper, but with a lot of floaters. (d) Our VastGaussian achieves higher quality and much faster rendering than state-of-the-art methods in large scene reconstruction, with much shorter training time.
  • Figure 2: (a) Appearance may vary in adjacent training views. (b) Dark or bright blobs may be created near cameras with training images of different brightnesses. (c) 3D Gaussian Splatting uses these blobs to fit the appearance variations, making the renderings similar to the training images in (a). (d) These blobs appear as floaters in novel views. (e) Our decoupled appearance modeling enables the model to learn constant colors, so the rendered images are more consistent in appearance across different views. (f) Our approach greatly reduces floaters in novel views.
  • Figure 3: Progressive data partitioning. Top row: (a) The whole scene is divided into multiple regions based on the 2D camera positions projected on the ground plane. (b) Parts of the training cameras and point cloud are assigned to a specific region according to its expanded boundaries. (c) More training cameras are selected to reduce floaters, based on an airspace-aware visibility criterion, where a camera is selected if it has sufficient visibility on this region. (d) More points of the point cloud are incorporated for better initialization of 3D Gaussians, if they are observed by the selected cameras. Bottom row: Two visibility definitions to select more training cameras. (e) A naive way: The visibility of the $i$-th camera on the $j$-th cell is defined as $\Omega^\text{surf}_{ij}/\Omega_{i}$, where $\Omega_{i}$ is the area of the image $\mathcal{I}_i$, and $\Omega^\text{surf}_{ij}$ is the convex hull area formed by the surface points in the $j$-th cell that are projected to $\mathcal{I}_i$. (f) Our airspace-aware solution: The convex hull area $\Omega^\text{air}_{ij}$ is calculated on the projection of the $j$-th cell's bounding box in $\mathcal{I}_i$. (g) Floaters caused by depth ambiguity with improper point initialization, which cannot be eliminated without sufficient supervision from training cameras.
  • Figure 4: Decoupled appearance modeling. The rendered image $\mathcal{I}^r_i$ is downsampled to a smaller resolution, concatenated by an optimizable appearance embedding $\mathbf{\ell }_i$ in a pixel-wise manner to obtain $\mathcal{D}_i$, and then fed into a CNN to generate a transformation map $\mathcal{M}_i$. $\mathcal{M}_i$ is used to perform appearance adjustment on $\mathcal{I}^r_i$ to get an appearance-variant image $\mathcal{I}^a_i$, which is used to calculate the loss $\mathcal{L}_1$ against the ground truth $\mathcal{I}_i$, while $\mathcal{I}^r_i$ is used to calculate the D-SSIM loss.
  • Figure 5: Qualitative comparison between VastGaussian and previous work. Floaters are pointed out by green arrows.
  • ...and 5 more figures