Balanced 3DGS: Gaussian-wise Parallelism Rendering with Fine-Grained Tiling
Hao Gui, Lin Hu, Rui Chen, Mingxiao Huang, Yuxin Yin, Jin Yang, Yong Wu, Chen Liu, Zhongxu Sun, Xueyang Zhang, Kun Zhan
TL;DR
The paper tackles load-imbalance in 3D Gaussian Splatting (3DGS) training, caused by uneven workloads across pixels, tiles, and training stages. It introduces Balanced 3DGS, combining inter-block dynamic workload distribution, Gaussian-wise parallel rendering, and fine-grained tiling, plus a self-adaptive kernel strategy to select the best renderer during training. Key results show substantial forward-render kernel speedups—up to 7.52x in isolation—and an overall e2e training speedup of about 8.5%, with occupancy near theoretical limits. The approach offers a practical route to faster, more balanced 3DGS training on a single GPU and sets the stage for future multi-GPU extensions.
Abstract
3D Gaussian Splatting (3DGS) is increasingly attracting attention in both academia and industry owing to its superior visual quality and rendering speed. However, training a 3DGS model remains a time-intensive task, especially in load imbalance scenarios where workload diversity among pixels and Gaussian spheres causes poor renderCUDA kernel performance. We introduce Balanced 3DGS, a Gaussian-wise parallelism rendering with fine-grained tiling approach in 3DGS training process, perfectly solving load-imbalance issues. First, we innovatively introduce the inter-block dynamic workload distribution technique to map workloads to Streaming Multiprocessor(SM) resources within a single GPU dynamically, which constitutes the foundation of load balancing. Second, we are the first to propose the Gaussian-wise parallel rendering technique to significantly reduce workload divergence inside a warp, which serves as a critical component in addressing load imbalance. Based on the above two methods, we further creatively put forward the fine-grained combined load balancing technique to uniformly distribute workload across all SMs, which boosts the forward renderCUDA kernel performance by up to 7.52x. Besides, we present a self-adaptive render kernel selection strategy during the 3DGS training process based on different load-balance situations, which effectively improves training efficiency.
