CityGS-X: A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction
Yuanyuan Gao, Hao Li, Jiaqi Chen, Zhengyu Zou, Zhihang Zhong, Dingwen Zhang, Xiao Sun, Junwei Han
TL;DR
CityGS-$\mathcal{X}$ introduces a scalable architecture for efficient and geometrically accurate large-scale scene reconstruction by employing a Parallelized Hybrid Hierarchical 3D Representation (PH$^2$-$3$D) with dynamic LoD voxel allocation and a shared Gaussian Decoder. It replaces the traditional merge-and-partition approach with batch-level multi-task rendering and a three-stage Batch-Level Consistent Progressive Training regime, enabling multi-GPU training and rendering with improved geometry and appearance consistency. Extensive experiments show faster training, larger renderable scales, and finer geometric details across urban-scale datasets, including successful training of 5k+ images in about 5 hours on 4× RTX 4090 GPUs, with strong memory efficiency and 4K rendering viability. The work demonstrates that co-design of a geometry-friendly 3D representation and batch-wise cross-view constraints can push large-scale neural rendering beyond prior memory and accuracy limits, offering a practical path toward real-time, high-fidelity city-scale reconstructions.
Abstract
Despite its significant achievements in large-scale scene reconstruction, 3D Gaussian Splatting still faces substantial challenges, including slow processing, high computational costs, and limited geometric accuracy. These core issues arise from its inherently unstructured design and the absence of efficient parallelization. To overcome these challenges simultaneously, we introduce CityGS-X, a scalable architecture built on a novel parallelized hybrid hierarchical 3D representation (PH^2-3D). As an early attempt, CityGS-X abandons the cumbersome merge-and-partition process and instead adopts a newly-designed batch-level multi-task rendering process. This architecture enables efficient multi-GPU rendering through dynamic Level-of-Detail voxel allocations, significantly improving scalability and performance. Through extensive experiments, CityGS-X consistently outperforms existing methods in terms of faster training times, larger rendering capacities, and more accurate geometric details in large-scale scenes. Notably, CityGS-X can train and render a scene with 5,000+ images in just 5 hours using only 4 * 4090 GPUs, a task that would make other alternative methods encounter Out-Of-Memory (OOM) issues and fail completely. This implies that CityGS-X is far beyond the capacity of other existing methods.
