Table of Contents
Fetching ...

CityGS-X: A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction

Yuanyuan Gao, Hao Li, Jiaqi Chen, Zhengyu Zou, Zhihang Zhong, Dingwen Zhang, Xiao Sun, Junwei Han

TL;DR

CityGS-$\mathcal{X}$ introduces a scalable architecture for efficient and geometrically accurate large-scale scene reconstruction by employing a Parallelized Hybrid Hierarchical 3D Representation (PH$^2$-$3$D) with dynamic LoD voxel allocation and a shared Gaussian Decoder. It replaces the traditional merge-and-partition approach with batch-level multi-task rendering and a three-stage Batch-Level Consistent Progressive Training regime, enabling multi-GPU training and rendering with improved geometry and appearance consistency. Extensive experiments show faster training, larger renderable scales, and finer geometric details across urban-scale datasets, including successful training of 5k+ images in about 5 hours on 4× RTX 4090 GPUs, with strong memory efficiency and 4K rendering viability. The work demonstrates that co-design of a geometry-friendly 3D representation and batch-wise cross-view constraints can push large-scale neural rendering beyond prior memory and accuracy limits, offering a practical path toward real-time, high-fidelity city-scale reconstructions.

Abstract

Despite its significant achievements in large-scale scene reconstruction, 3D Gaussian Splatting still faces substantial challenges, including slow processing, high computational costs, and limited geometric accuracy. These core issues arise from its inherently unstructured design and the absence of efficient parallelization. To overcome these challenges simultaneously, we introduce CityGS-X, a scalable architecture built on a novel parallelized hybrid hierarchical 3D representation (PH^2-3D). As an early attempt, CityGS-X abandons the cumbersome merge-and-partition process and instead adopts a newly-designed batch-level multi-task rendering process. This architecture enables efficient multi-GPU rendering through dynamic Level-of-Detail voxel allocations, significantly improving scalability and performance. Through extensive experiments, CityGS-X consistently outperforms existing methods in terms of faster training times, larger rendering capacities, and more accurate geometric details in large-scale scenes. Notably, CityGS-X can train and render a scene with 5,000+ images in just 5 hours using only 4 * 4090 GPUs, a task that would make other alternative methods encounter Out-Of-Memory (OOM) issues and fail completely. This implies that CityGS-X is far beyond the capacity of other existing methods.

CityGS-X: A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction

TL;DR

CityGS- introduces a scalable architecture for efficient and geometrically accurate large-scale scene reconstruction by employing a Parallelized Hybrid Hierarchical 3D Representation (PH-D) with dynamic LoD voxel allocation and a shared Gaussian Decoder. It replaces the traditional merge-and-partition approach with batch-level multi-task rendering and a three-stage Batch-Level Consistent Progressive Training regime, enabling multi-GPU training and rendering with improved geometry and appearance consistency. Extensive experiments show faster training, larger renderable scales, and finer geometric details across urban-scale datasets, including successful training of 5k+ images in about 5 hours on 4× RTX 4090 GPUs, with strong memory efficiency and 4K rendering viability. The work demonstrates that co-design of a geometry-friendly 3D representation and batch-wise cross-view constraints can push large-scale neural rendering beyond prior memory and accuracy limits, offering a practical path toward real-time, high-fidelity city-scale reconstructions.

Abstract

Despite its significant achievements in large-scale scene reconstruction, 3D Gaussian Splatting still faces substantial challenges, including slow processing, high computational costs, and limited geometric accuracy. These core issues arise from its inherently unstructured design and the absence of efficient parallelization. To overcome these challenges simultaneously, we introduce CityGS-X, a scalable architecture built on a novel parallelized hybrid hierarchical 3D representation (PH^2-3D). As an early attempt, CityGS-X abandons the cumbersome merge-and-partition process and instead adopts a newly-designed batch-level multi-task rendering process. This architecture enables efficient multi-GPU rendering through dynamic Level-of-Detail voxel allocations, significantly improving scalability and performance. Through extensive experiments, CityGS-X consistently outperforms existing methods in terms of faster training times, larger rendering capacities, and more accurate geometric details in large-scale scenes. Notably, CityGS-X can train and render a scene with 5,000+ images in just 5 hours using only 4 * 4090 GPUs, a task that would make other alternative methods encounter Out-Of-Memory (OOM) issues and fail completely. This implies that CityGS-X is far beyond the capacity of other existing methods.

Paper Structure

This paper contains 14 sections, 10 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: We propose CityGS-$\mathcal{X}$, a scalable architecture for efficient and geometrically accurate large-scale scene reconstruction (left). The middle column shows qualitative results of rendered depth between our method and CityGS-v2 liu2024citygaussianv2, demonstrating CityGS-$\mathcal{X}$'s superior geometric representation with smoother object surfaces. The top-right PSNR chart highlights CityGS-$\mathcal{X}$'s superior reconstruction quality across various GPU configurations while significantly reducing time consumption. The bottom-right section highlights the memory efficiency of CityGS-$\mathcal{X}$, successfully handling high-resolution 4K rendering, while CityGS-v2 encounters out-of-memory issues.
  • Figure 2: Comparison between our parallel architecture and previous methods that reconstruct large-scale scenes using a partition-after-merge strategy. (a) uses overlap partition training and suffers from issues in the overlap areas during merging, which is time-consuming. (b) instead employs model distillation to enable distributed training, though it still introduces extra time for the distillation process. Moreover, both (a) and (b) restrict Gaussian rendering to a single GPU, limiting the size of the final merged model. (c) Our approach introduces a novel paradigm with Parallel Hybrid Hierarchical 3D Representation and parallel batch-level rendering techniques, offering enhanced scalability and efficiency for large-scale scene reconstruction.
  • Figure 3: CityGS-$\mathcal{X}$ is a scalable framework that eliminates the partition-and-merge paradigm by utilizing a parallel training as well as rendering technique. (a) PH$^2$-3D dynamically allocates LoD voxels across multiple GPUs for steps in training (Sec. \ref{['sec:representation']}). (b) The distributed multi-task rendering strategy divides images to be rendered into patches and assigns them to different GPUs for rendering RGB/Depth/Normal in parallel (Sec. \ref{['sec:rendering']}). (c) Building upon this framework, we introduce a novel progressive training strategy that applies multi-view consistency on batch training for appearance quality and geometry accuracy (Sec. \ref{['sec:training']}).
  • Figure 4: Visualization of enhanced and vanilla pseudo depth. In enhanced depth, multi-view inconsistency depth information is filtered out and shown as black regions.
  • Figure 5: Qualitative mesh and texture comparison between CityGS-v2 liu2024citygaussianv2 and our method on the Residence and Sci-Art scenes lin2022capturing.
  • ...and 9 more figures