Table of Contents
Fetching ...

RSH-SpMM: A Row-Structured Hybrid Kernel for Sparse Matrix-Matrix Multiplication on GPUs

Aiying Li, Jingwei Sun, Han Li, Wence Ji, Guangzhong Sun

TL;DR

RSH-SpMM is presented, a fine-grained row-structured hybrid SpMM framework designed to better align irregular sparsity with modern GPU execution pipelines, and employs a load-balanced hybrid kernel with locality-aware reordering to enhance structural coherence and sustain high execution efficiency under highly irregular sparsity.

Abstract

Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental computation in graph analytics, scientific simulation, and sparse deep learning workloads. However, the extreme irregularity of real-world sparse matrices prevents existing GPU-based methods from maintaining high Tensor Core utilization and stable throughput. We present \textbf{RSH-SpMM}, a fine-grained row-structured hybrid SpMM framework designed to better align irregular sparsity with modern GPU execution pipelines. RSH-SpMM introduces adaptive row partitioning and employs the RS-Tile representation to expose Tensor-Core-efficient dense fragments, while processing irregular rows on a minimal-overhead CUDA execution path. It further employs a load-balanced hybrid kernel with locality-aware reordering to enhance structural coherence and sustain high execution efficiency under highly irregular sparsity. Comprehensive evaluations across diverse sparse workloads demonstrate that RSH-SpMM consistently outperforms state-of-the-art SpMM designs, yielding 1.27x - 6.13x acceleration and maintaining robust performance across matrices with highly irregular sparsity structures.

RSH-SpMM: A Row-Structured Hybrid Kernel for Sparse Matrix-Matrix Multiplication on GPUs

TL;DR

RSH-SpMM is presented, a fine-grained row-structured hybrid SpMM framework designed to better align irregular sparsity with modern GPU execution pipelines, and employs a load-balanced hybrid kernel with locality-aware reordering to enhance structural coherence and sustain high execution efficiency under highly irregular sparsity.

Abstract

Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental computation in graph analytics, scientific simulation, and sparse deep learning workloads. However, the extreme irregularity of real-world sparse matrices prevents existing GPU-based methods from maintaining high Tensor Core utilization and stable throughput. We present \textbf{RSH-SpMM}, a fine-grained row-structured hybrid SpMM framework designed to better align irregular sparsity with modern GPU execution pipelines. RSH-SpMM introduces adaptive row partitioning and employs the RS-Tile representation to expose Tensor-Core-efficient dense fragments, while processing irregular rows on a minimal-overhead CUDA execution path. It further employs a load-balanced hybrid kernel with locality-aware reordering to enhance structural coherence and sustain high execution efficiency under highly irregular sparsity. Comprehensive evaluations across diverse sparse workloads demonstrate that RSH-SpMM consistently outperforms state-of-the-art SpMM designs, yielding 1.27x - 6.13x acceleration and maintaining robust performance across matrices with highly irregular sparsity structures.
Paper Structure (23 sections, 4 equations, 17 figures, 1 table, 3 algorithms)

This paper contains 23 sections, 4 equations, 17 figures, 1 table, 3 algorithms.

Figures (17)

  • Figure 1: Comparison of GPU SpMM execution paradigms.
  • Figure 2: Structural characteristics of matrices in the SuiteSparse collection. (a) Distribution of row counts grouped by the number of nonzeros (nnz). The long-row ratios ① and ② correspond to rows with $\mathrm{nnz} > 2\times nnz_{\text{mean}}$ and $\mathrm{nnz} > 4\times nnz_{\text{mean}}$, respectively.
  • Figure 3: Overview of RSH-SpMM.
  • Figure 4: RS-Tile compressed format of sparse matrix A. A $4\times 4$ tile is shown for convenience in illustration, whereas our implementation employs the native $8\times 8$ MMA footprint on Tensor Cores.
  • Figure 5: Dataflow of the Tensor Core computation.
  • ...and 12 more figures