Table of Contents
Fetching ...

GEMM-GS: Accelerating 3D Gaussian Splatting on Tensor Cores with GEMM-Compatible Blending

Haomin Li, Bowen Zhu, Fangxin Liu, Zongwu Wang, Xinran Liang, Li Jiang, Haibing Guan

Abstract

Neural Radiance Fields (NeRF) enables 3D scene reconstruction from several 2D images but incurs high rendering latency via its point-sampling design. 3D Gaussian Splatting (3DGS) improves on NeRF with explicit scene representation and an optimized pipeline yet still fails to meet practical real-time demands. Existing acceleration works overlook the evolving Tensor Cores of modern GPUs because 3DGS pipeline lacks General Matrix Multiplication (GEMM) operations. This paper proposes GEMM-GS, an acceleration approach utilizing tensor cores on GPUs via GEMM-friendly blending transformation. It equivalently reformulates the 3DGS blending process into a GEMM-compatible form to utilize Tensor Cores. A high-performance CUDA kernel is designed, integrating a three-stage double-buffered pipeline that overlaps computation and memory access. Extensive experiments show that GEMM-GS achieves $1.42\times$ speedup over vanilla 3DGS and provides an additional $1.47\times$ speedup on average when combining with existing acceleration approaches. Code is released at https://github.com/shieldforever/GEMM-GS.

GEMM-GS: Accelerating 3D Gaussian Splatting on Tensor Cores with GEMM-Compatible Blending

Abstract

Neural Radiance Fields (NeRF) enables 3D scene reconstruction from several 2D images but incurs high rendering latency via its point-sampling design. 3D Gaussian Splatting (3DGS) improves on NeRF with explicit scene representation and an optimized pipeline yet still fails to meet practical real-time demands. Existing acceleration works overlook the evolving Tensor Cores of modern GPUs because 3DGS pipeline lacks General Matrix Multiplication (GEMM) operations. This paper proposes GEMM-GS, an acceleration approach utilizing tensor cores on GPUs via GEMM-friendly blending transformation. It equivalently reformulates the 3DGS blending process into a GEMM-compatible form to utilize Tensor Cores. A high-performance CUDA kernel is designed, integrating a three-stage double-buffered pipeline that overlaps computation and memory access. Extensive experiments show that GEMM-GS achieves speedup over vanilla 3DGS and provides an additional speedup on average when combining with existing acceleration approaches. Code is released at https://github.com/shieldforever/GEMM-GS.

Paper Structure

This paper contains 15 sections, 8 equations, 7 figures, 2 tables, 2 algorithms.

Figures (7)

  • Figure 1: Computing Power Breakdown of modern GPUs utilized by 3D Gaussian Splatting. Data is collected from the data sheets of the GPU products V100_DataSheetA100_DataSheetH100_DataSheetH200_DataSheetB200_DataSheet.
  • Figure 2: Neural Rendering and Process of 3DGS kerbl20233d. (a) Neural rendering (process of novel view synthesis). 3DGS consists of three stages. (b) Stage 1: Preprocessing. Gaussians are projected onto the render image and intersection test is performed to relating projected Gaussians and tiles. Gaussians' features are also computed, including depth $d$ and color $\mathbf{c}$. (c) Stage 2: Duplication. Each Gaussian is duplicated according to the number of tiles it intersects with. (d) Stage 3: Sorting. Gaussians in each tile are sorted by depth $d$. (e) Stage 4: Blending. The pixels in one tile are rendered by volume rendering in parallel, with the same sorted Gaussian list.
  • Figure 3: Rendering Latency Breakdown of 3DGS. The scenes for rendering are from three datasets: Tank&Temples knapitsch2017tanks, Deep Blending hedman2018deep, and Mip-NeRF 360 barron2022mip.
  • Figure 4: High-Performance GPU Kernel Design and Implementation. (a) Dataflow of 3-stage pipeline configured with double buffer. (b) Execution timing of loop iterations and detailed operations in a single iteration.
  • Figure 5: Average image rendering latency (ms) comparison between GEMM-GS and baseline methods on an H100 GPU.
  • ...and 2 more figures