Table of Contents
Fetching ...

Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

Haisha Zhao, San Li, Jiaheng Wang, Chunbao Zhou, Jue Wang, Zhikuang Xin, Shunde Li, Zhiqiang Liang, Zhijie Pan, Fang Liu, Yan Zeng, Yangang Wang, Xuebin Chi

TL;DR

Acc-SpMM tackles the challenge of accelerating general-purpose SpMM on Tensor Cores by introducing a four-part optimization stack: data-affinity-based reordering to densify TC blocks, BitTCF memory compression to minimize storage and bandwidth, a high-throughput pipeline to overlap data movement with MMA, and adaptive sparsity-aware load balancing to evenly distribute work across TBs. The combination yields substantial speedups over cuSPARSE across multiple GPUs (average 2.52× on RTX 4090, 1.91× on A800, 1.58× on H100) and for matrices with higher avg nonzero per row, demonstrating improved data density, cache utilization, and hardware throughput. Extensive experiments on real-world 10-matrix sets and 414 SuiteSparse matrices show notable gains from each component, with ablations confirming the necessity of BitTCF, data reordering, pipeline, and load balancing. The work promises practical impact for GNNs and large-scale sparse computations, and future work includes column-row reordering and end-to-end integration into frameworks like DGL to enable broader adoption.

Abstract

General-purpose Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel in scientific computing and deep learning. The emergence of new matrix computation units such as Tensor Cores (TCs) brings more opportunities for SpMM acceleration. However, in order to fully unleash the power of hardware performance, systematic optimization is required. In this paper, we propose Acc-SpMM, a high-performance SpMM library on TCs, with multiple optimizations, including data-affinity-based reordering, memory efficient compressed format, high-throughput pipeline, and adaptive sparsity-aware load balancing. In contrast to the state-of-the-art SpMM kernels on various NVIDIA GPU architectures with a diverse range of benchmark matrices, Acc-SpMM achieves significant performance improvements, on average 2.52x (up to 5.11x) speedup on RTX 4090, on average 1.91x (up to 4.68x) speedup on A800, and on average 1.58x (up to 3.60x) speedup on H100 over cuSPARSE.

Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

TL;DR

Acc-SpMM tackles the challenge of accelerating general-purpose SpMM on Tensor Cores by introducing a four-part optimization stack: data-affinity-based reordering to densify TC blocks, BitTCF memory compression to minimize storage and bandwidth, a high-throughput pipeline to overlap data movement with MMA, and adaptive sparsity-aware load balancing to evenly distribute work across TBs. The combination yields substantial speedups over cuSPARSE across multiple GPUs (average 2.52× on RTX 4090, 1.91× on A800, 1.58× on H100) and for matrices with higher avg nonzero per row, demonstrating improved data density, cache utilization, and hardware throughput. Extensive experiments on real-world 10-matrix sets and 414 SuiteSparse matrices show notable gains from each component, with ablations confirming the necessity of BitTCF, data reordering, pipeline, and load balancing. The work promises practical impact for GNNs and large-scale sparse computations, and future work includes column-row reordering and end-to-end integration into frameworks like DGL to enable broader adoption.

Abstract

General-purpose Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel in scientific computing and deep learning. The emergence of new matrix computation units such as Tensor Cores (TCs) brings more opportunities for SpMM acceleration. However, in order to fully unleash the power of hardware performance, systematic optimization is required. In this paper, we propose Acc-SpMM, a high-performance SpMM library on TCs, with multiple optimizations, including data-affinity-based reordering, memory efficient compressed format, high-throughput pipeline, and adaptive sparsity-aware load balancing. In contrast to the state-of-the-art SpMM kernels on various NVIDIA GPU architectures with a diverse range of benchmark matrices, Acc-SpMM achieves significant performance improvements, on average 2.52x (up to 5.11x) speedup on RTX 4090, on average 1.91x (up to 4.68x) speedup on A800, and on average 1.58x (up to 3.60x) speedup on H100 over cuSPARSE.
Paper Structure (21 sections, 3 equations, 15 figures, 3 tables, 2 algorithms)

This paper contains 21 sections, 3 equations, 15 figures, 3 tables, 2 algorithms.

Figures (15)

  • Figure 1: The overview of Acc-SpMM
  • Figure 2: The design of data-affinity-based reordering. (a) is the original graph, (b) is dendrogram construction and node remapping, (c) is the reordered graph, (d) is the adjacency matrix of the original graph (a), and (e) is the adjacency matrix of the reordered graph (c).
  • Figure 3: The design of BitTCF format.
  • Figure 4: The data movement of Acc-SpMM. TC block is moved to shared memory for reuse. Dense $B$ tile is moved directly to registers. Dense matrix $C$ is moved to global memory from registers.
  • Figure 5: Comparisons between our proposed pipeline(b) with DTC-pipeline(a). GToReg: global memory to register; GToSHM: global memory to shared memory; TCMMA: tensor core mma.
  • ...and 10 more figures