Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores
Haisha Zhao, San Li, Jiaheng Wang, Chunbao Zhou, Jue Wang, Zhikuang Xin, Shunde Li, Zhiqiang Liang, Zhijie Pan, Fang Liu, Yan Zeng, Yangang Wang, Xuebin Chi
TL;DR
Acc-SpMM tackles the challenge of accelerating general-purpose SpMM on Tensor Cores by introducing a four-part optimization stack: data-affinity-based reordering to densify TC blocks, BitTCF memory compression to minimize storage and bandwidth, a high-throughput pipeline to overlap data movement with MMA, and adaptive sparsity-aware load balancing to evenly distribute work across TBs. The combination yields substantial speedups over cuSPARSE across multiple GPUs (average 2.52× on RTX 4090, 1.91× on A800, 1.58× on H100) and for matrices with higher avg nonzero per row, demonstrating improved data density, cache utilization, and hardware throughput. Extensive experiments on real-world 10-matrix sets and 414 SuiteSparse matrices show notable gains from each component, with ablations confirming the necessity of BitTCF, data reordering, pipeline, and load balancing. The work promises practical impact for GNNs and large-scale sparse computations, and future work includes column-row reordering and end-to-end integration into frameworks like DGL to enable broader adoption.
Abstract
General-purpose Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel in scientific computing and deep learning. The emergence of new matrix computation units such as Tensor Cores (TCs) brings more opportunities for SpMM acceleration. However, in order to fully unleash the power of hardware performance, systematic optimization is required. In this paper, we propose Acc-SpMM, a high-performance SpMM library on TCs, with multiple optimizations, including data-affinity-based reordering, memory efficient compressed format, high-throughput pipeline, and adaptive sparsity-aware load balancing. In contrast to the state-of-the-art SpMM kernels on various NVIDIA GPU architectures with a diverse range of benchmark matrices, Acc-SpMM achieves significant performance improvements, on average 2.52x (up to 5.11x) speedup on RTX 4090, on average 1.91x (up to 4.68x) speedup on A800, and on average 1.58x (up to 3.60x) speedup on H100 over cuSPARSE.
