Table of Contents
Fetching ...

TC-GS: A Faster Gaussian Splatting Module Utilizing Tensor Cores

Zimu Liao, Jifeng Ding, Siwei Cui, Ruixuan Gong, Boni Hu, Yi Wang, Hengjie Li, XIngcheng Zhang, Hui Wang, Rong Fu

TL;DR

This work tackles the slow rendering speed of 3D Gaussian Splatting (3DGS) caused by the costly conditional alpha-blending step. It introduces TC-GS, a hardware-aware, plug-and-play module that maps alpha computation to matrix multiplications to fully exploit Tensor Cores, enabling broad applicability across 3DGS pipelines. The approach comprises EarlyCull (pruning), Frag2Mat (batched matrix formulation of alpha), and G2L (local coordinate transformation) to maintain numerical stability on FP16 while delivering substantial speedups. Empirical results show up to 5.6x total acceleration, with around 2x–4x gains in alpha-blending and preserved image quality across multiple datasets and renderers. The proposed module thus enables real-time or edge-ready neural rendering workflows by leveraging existing GPU tensor cores without altering the core 3DGS models.

Abstract

3D Gaussian Splatting (3DGS) renders pixels by rasterizing Gaussian primitives, where conditional alpha-blending dominates the computational cost in the rendering pipeline. This paper proposes TC-GS, an algorithm-independent universal module that expands the applicability of Tensor Core (TCU) for 3DGS, leading to substantial speedups and seamless integration into existing 3DGS optimization frameworks. The key innovation lies in mapping alpha computation to matrix multiplication, fully utilizing otherwise idle TCUs in existing 3DGS implementations. TC-GS provides plug-and-play acceleration for existing top-tier acceleration algorithms and integrates seamlessly with rendering pipeline designs, such as Gaussian compression and redundancy elimination algorithms. Additionally, we introduce a global-to-local coordinate transformation to mitigate rounding errors from quadratic terms of pixel coordinates caused by Tensor Core half-precision computation. Extensive experiments demonstrate that our method maintains rendering quality while providing an additional 2.18x speedup over existing Gaussian acceleration algorithms, thereby achieving a total acceleration of up to 5.6x.

TC-GS: A Faster Gaussian Splatting Module Utilizing Tensor Cores

TL;DR

This work tackles the slow rendering speed of 3D Gaussian Splatting (3DGS) caused by the costly conditional alpha-blending step. It introduces TC-GS, a hardware-aware, plug-and-play module that maps alpha computation to matrix multiplications to fully exploit Tensor Cores, enabling broad applicability across 3DGS pipelines. The approach comprises EarlyCull (pruning), Frag2Mat (batched matrix formulation of alpha), and G2L (local coordinate transformation) to maintain numerical stability on FP16 while delivering substantial speedups. Empirical results show up to 5.6x total acceleration, with around 2x–4x gains in alpha-blending and preserved image quality across multiple datasets and renderers. The proposed module thus enables real-time or edge-ready neural rendering workflows by leveraging existing GPU tensor cores without altering the core 3DGS models.

Abstract

3D Gaussian Splatting (3DGS) renders pixels by rasterizing Gaussian primitives, where conditional alpha-blending dominates the computational cost in the rendering pipeline. This paper proposes TC-GS, an algorithm-independent universal module that expands the applicability of Tensor Core (TCU) for 3DGS, leading to substantial speedups and seamless integration into existing 3DGS optimization frameworks. The key innovation lies in mapping alpha computation to matrix multiplication, fully utilizing otherwise idle TCUs in existing 3DGS implementations. TC-GS provides plug-and-play acceleration for existing top-tier acceleration algorithms and integrates seamlessly with rendering pipeline designs, such as Gaussian compression and redundancy elimination algorithms. Additionally, we introduce a global-to-local coordinate transformation to mitigate rounding errors from quadratic terms of pixel coordinates caused by Tensor Core half-precision computation. Extensive experiments demonstrate that our method maintains rendering quality while providing an additional 2.18x speedup over existing Gaussian acceleration algorithms, thereby achieving a total acceleration of up to 5.6x.

Paper Structure

This paper contains 39 sections, 54 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: a) Three types of fragments on a pixel. b) If the Gaussian only covers a small portion of tile, a large amount culled fragments are generated.
  • Figure 2: Analysis of 3DGS rendering bottlenecks: The left side shows the time distribution of preprocessing, sorting, and alpha-blending for 3DGS, AdR-Gaussian, and Speedy-Splat, with alpha-blending dominating. The right side details alpha computation, culling, and blending, identifying culled fragments as the primary bottleneck due to redundant alpha computations, while skipped fragments incur no cost.
  • Figure 3: Design of Frag2Mat: the alpha computation is reformulated as matrix multiplication to fully leverage Tensor Cores for accelerating alpha calculation.
  • Figure 4: The quadratic term of pixel coordinates significantly contributes the rounding error, resulting in blurry images. Local coordinates constrain the value of $\Delta p$ into $[-8,8]^2$, which can reduce the upper bound of the rounding error.
  • Figure 5: Ablation study on G2L when applying TC-GS on original 3DGS.