Table of Contents
Fetching ...

Generative Gaussian Splatting for Unbounded 3D City Generation

Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu

TL;DR

GaussianCity introduces a compact BEV-Point representation and a BEV-Point Decoder to enable unbounded 3D city generation with 3D Gaussian Splatting. By filtering to visible BEV points and decoupling position-related and style-related attributes, it keeps VRAM usage constant while achieving high realism and efficiency, outperforming prior NeRF-based and Gauss-splat methods with significant speedups. The approach is validated on GoogleEarth and KITTI-360, showing state-of-the-art results in drone-view and street-view generation, along with thorough ablations and analysis of limitations. This work enables scalable, real-time generation of large-scale city scenes with practical implications for gaming, simulation, and VR/AR applications.

Abstract

3D city generation with NeRF-based methods shows promising generation results but is computationally inefficient. Recently 3D Gaussian Splatting (3D-GS) has emerged as a highly efficient alternative for object-level 3D generation. However, adapting 3D-GS from finite-scale 3D objects and humans to infinite-scale 3D cities is non-trivial. Unbounded 3D city generation entails significant storage overhead (out-of-memory issues), arising from the need to expand points to billions, often demanding hundreds of Gigabytes of VRAM for a city scene spanning 10km^2. In this paper, we propose GaussianCity, a generative Gaussian Splatting framework dedicated to efficiently synthesizing unbounded 3D cities with a single feed-forward pass. Our key insights are two-fold: 1) Compact 3D Scene Representation: We introduce BEV-Point as a highly compact intermediate representation, ensuring that the growth in VRAM usage for unbounded scenes remains constant, thus enabling unbounded city generation. 2) Spatial-aware Gaussian Attribute Decoder: We present spatial-aware BEV-Point decoder to produce 3D Gaussian attributes, which leverages Point Serializer to integrate the structural and contextual characteristics of BEV points. Extensive experiments demonstrate that GaussianCity achieves state-of-the-art results in both drone-view and street-view 3D city generation. Notably, compared to CityDreamer, GaussianCity exhibits superior performance with a speedup of 60 times (10.72 FPS v.s. 0.18 FPS).

Generative Gaussian Splatting for Unbounded 3D City Generation

TL;DR

GaussianCity introduces a compact BEV-Point representation and a BEV-Point Decoder to enable unbounded 3D city generation with 3D Gaussian Splatting. By filtering to visible BEV points and decoupling position-related and style-related attributes, it keeps VRAM usage constant while achieving high realism and efficiency, outperforming prior NeRF-based and Gauss-splat methods with significant speedups. The approach is validated on GoogleEarth and KITTI-360, showing state-of-the-art results in drone-view and street-view generation, along with thorough ablations and analysis of limitations. This work enables scalable, real-time generation of large-scale city scenes with practical implications for gaming, simulation, and VR/AR applications.

Abstract

3D city generation with NeRF-based methods shows promising generation results but is computationally inefficient. Recently 3D Gaussian Splatting (3D-GS) has emerged as a highly efficient alternative for object-level 3D generation. However, adapting 3D-GS from finite-scale 3D objects and humans to infinite-scale 3D cities is non-trivial. Unbounded 3D city generation entails significant storage overhead (out-of-memory issues), arising from the need to expand points to billions, often demanding hundreds of Gigabytes of VRAM for a city scene spanning 10km^2. In this paper, we propose GaussianCity, a generative Gaussian Splatting framework dedicated to efficiently synthesizing unbounded 3D cities with a single feed-forward pass. Our key insights are two-fold: 1) Compact 3D Scene Representation: We introduce BEV-Point as a highly compact intermediate representation, ensuring that the growth in VRAM usage for unbounded scenes remains constant, thus enabling unbounded city generation. 2) Spatial-aware Gaussian Attribute Decoder: We present spatial-aware BEV-Point decoder to produce 3D Gaussian attributes, which leverages Point Serializer to integrate the structural and contextual characteristics of BEV points. Extensive experiments demonstrate that GaussianCity achieves state-of-the-art results in both drone-view and street-view 3D city generation. Notably, compared to CityDreamer, GaussianCity exhibits superior performance with a speedup of 60 times (10.72 FPS v.s. 0.18 FPS).
Paper Structure (23 sections, 18 equations, 13 figures, 5 tables)

This paper contains 23 sections, 18 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: (a) Benefiting from the compact BEV-Point representation, GaussianCity can generate unbounded 3D cities using 3D Gaussian splatting (3D-GS). (b) As the number of points increases, VRAM usage during 3D-GS training rises significantly, whereas BEV-Point, acting as a compact representation, maintains a constant VRAM usage. (c) As the number of points increases, BEV-Point exhibits significantly lower growth in file storage compared to 3D-GS. (d) The proposed GaussianCity achieves not only superior generation quality but also the best efficiency in 3D city generation.
  • Figure 2: The framework of GaussianCity. To create an unbounded 3D city, the BEV points are firstly generated from a local patch of the BEV maps, which includes the height field $\mathbf{H}$, semantic map $\mathbf{S}$, and binary density map $\mathbf{D}$. Then, the BEV-Point attributes $\left\{\mathbf{I}, \mathbf{C}_{\it A}, \mathbf{C}_{\it R}, \mathbf{F}_{\it S}\right\}$ are generated for each point and the Style Lookup Table: $\mathcal{T}(\mathbf{L}) \to \mathbf{Z}_T$ is generated for each instance. Next, BEV-Point Decoder generates the Gaussian attributes $\mathbf{A}$ from BEV-Point attributes. Finally, Gaussian Rasterizer $\mathcal{R}$ produces the rendered image $\mathbf{R}$.
  • Figure 3: Qualitative comparison on GoogleEarth. Note that "Pers.Nature" is short for PersistentNature DBLP:conf/cvpr/Chai0LIS23. The visual results of InfiniCity DBLP:conf/iccv/LinLMCS0T23 are provided by the authors since the source code is not accessible.
  • Figure 4: Qualitative comparison on KITTI-360. The visual results of UrbanGIRAFFE DBLP:conf/iccv/YangYGX0L23 are provided by the authors since the training code and pretrained model are unavailable.
  • Figure 5: User study on GoogleEarth and KITTI-360. All scores are in the range of 5, with 5 indicating the best. Note that "Pers.Nature" and "UrbanGIR." denotes PersistentNature DBLP:conf/cvpr/Chai0LIS23 and UrbanGIRAFFE DBLP:conf/iccv/YangYGX0L23, respectively.
  • ...and 8 more figures