Table of Contents
Fetching ...

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

Jinliang Shi, Shigang Li, Youxuan Xu, Rongtian Fu, Xueying Wang, Tong Wu

TL;DR

FlashSparse tackles the bottleneck of accelerating sparse matrix operations on Tensor Core Units by shrinking the nonzero-vector granularity to $8\times1$ through a swap-and-transpose MMA strategy, ensuring full MMA utilization while reducing computation and data access redundancy. It introduces a memory-efficient ME-BCRS storage format and a coalesced-thread mapping scheme to further cut memory traffic. Empirical results on NVIDIA H100 and RTX4090 reveal state-of-the-art SpMM and SDDMM performance, with up to $5.5\times$ and $3.22\times$ speedups over strong baselines, and notable end-to-end gains for GNNs. The approach delivers practical, hardware-aware sparse acceleration suitable for unstructured sparse data and broad TCU architectures.

Abstract

Sparse Matrix-matrix Multiplication (SpMM) and Sampled Dense-dense Matrix Multiplication (SDDMM) are important sparse operators in scientific computing and deep learning. Tensor Core Units (TCUs) enhance modern accelerators with superior computing power, which is promising to boost the performance of matrix operators to a higher level. However, due to the irregularity of unstructured sparse data, it is difficult to deliver practical speedups on TCUs. To this end, we propose FlashSparse, a novel approach to bridge the gap between sparse workloads and the TCU architecture. Specifically, FlashSparse minimizes the sparse granularity for SpMM and SDDMM on TCUs through a novel swap-and-transpose matrix multiplication strategy. Benefiting from the minimum sparse granularity, the computation redundancy is remarkably reduced while the computing power of TCUs is fully utilized. Besides, FlashSparse is equipped with a memory-efficient thread mapping strategy for coalesced data access and a sparse matrix storage format to save memory footprint. Extensive experimental results on H100 and RTX 4090 GPUs show that FlashSparse sets a new state-of-the-art for sparse matrix multiplications (geometric mean 5.5x speedup over DTC-SpMM and 3.22x speedup over RoDe).

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

TL;DR

FlashSparse tackles the bottleneck of accelerating sparse matrix operations on Tensor Core Units by shrinking the nonzero-vector granularity to through a swap-and-transpose MMA strategy, ensuring full MMA utilization while reducing computation and data access redundancy. It introduces a memory-efficient ME-BCRS storage format and a coalesced-thread mapping scheme to further cut memory traffic. Empirical results on NVIDIA H100 and RTX4090 reveal state-of-the-art SpMM and SDDMM performance, with up to and speedups over strong baselines, and notable end-to-end gains for GNNs. The approach delivers practical, hardware-aware sparse acceleration suitable for unstructured sparse data and broad TCU architectures.

Abstract

Sparse Matrix-matrix Multiplication (SpMM) and Sampled Dense-dense Matrix Multiplication (SDDMM) are important sparse operators in scientific computing and deep learning. Tensor Core Units (TCUs) enhance modern accelerators with superior computing power, which is promising to boost the performance of matrix operators to a higher level. However, due to the irregularity of unstructured sparse data, it is difficult to deliver practical speedups on TCUs. To this end, we propose FlashSparse, a novel approach to bridge the gap between sparse workloads and the TCU architecture. Specifically, FlashSparse minimizes the sparse granularity for SpMM and SDDMM on TCUs through a novel swap-and-transpose matrix multiplication strategy. Benefiting from the minimum sparse granularity, the computation redundancy is remarkably reduced while the computing power of TCUs is fully utilized. Besides, FlashSparse is equipped with a memory-efficient thread mapping strategy for coalesced data access and a sparse matrix storage format to save memory footprint. Extensive experimental results on H100 and RTX 4090 GPUs show that FlashSparse sets a new state-of-the-art for sparse matrix multiplications (geometric mean 5.5x speedup over DTC-SpMM and 3.22x speedup over RoDe).

Paper Structure

This paper contains 19 sections, 2 equations, 16 figures, 8 tables, 1 algorithm.

Figures (16)

  • Figure 1: The number of MMA invocations under 16x1 and 8x1 nonzero vector sizes in SpMM. Note that the unit used for IGB-large is ten millions for clear presentation.
  • Figure 2: SpMM on TCUs with nonzero vector size of 16$\times$1. The operand shape of MMA is m16n8k8.
  • Figure 3: Overview of FlashSparse.
  • Figure 4: The swap-and-transpose MMA computation.
  • Figure 5: The implementation of SpMM with the swap-and-transpose MMA computation strategy.
  • ...and 11 more figures