TC-GS: A Faster Gaussian Splatting Module Utilizing Tensor Cores
Zimu Liao, Jifeng Ding, Siwei Cui, Ruixuan Gong, Boni Hu, Yi Wang, Hengjie Li, XIngcheng Zhang, Hui Wang, Rong Fu
TL;DR
This work tackles the slow rendering speed of 3D Gaussian Splatting (3DGS) caused by the costly conditional alpha-blending step. It introduces TC-GS, a hardware-aware, plug-and-play module that maps alpha computation to matrix multiplications to fully exploit Tensor Cores, enabling broad applicability across 3DGS pipelines. The approach comprises EarlyCull (pruning), Frag2Mat (batched matrix formulation of alpha), and G2L (local coordinate transformation) to maintain numerical stability on FP16 while delivering substantial speedups. Empirical results show up to 5.6x total acceleration, with around 2x–4x gains in alpha-blending and preserved image quality across multiple datasets and renderers. The proposed module thus enables real-time or edge-ready neural rendering workflows by leveraging existing GPU tensor cores without altering the core 3DGS models.
Abstract
3D Gaussian Splatting (3DGS) renders pixels by rasterizing Gaussian primitives, where conditional alpha-blending dominates the computational cost in the rendering pipeline. This paper proposes TC-GS, an algorithm-independent universal module that expands the applicability of Tensor Core (TCU) for 3DGS, leading to substantial speedups and seamless integration into existing 3DGS optimization frameworks. The key innovation lies in mapping alpha computation to matrix multiplication, fully utilizing otherwise idle TCUs in existing 3DGS implementations. TC-GS provides plug-and-play acceleration for existing top-tier acceleration algorithms and integrates seamlessly with rendering pipeline designs, such as Gaussian compression and redundancy elimination algorithms. Additionally, we introduce a global-to-local coordinate transformation to mitigate rounding errors from quadratic terms of pixel coordinates caused by Tensor Core half-precision computation. Extensive experiments demonstrate that our method maintains rendering quality while providing an additional 2.18x speedup over existing Gaussian acceleration algorithms, thereby achieving a total acceleration of up to 5.6x.
