Table of Contents
Fetching ...

cuTeSpMM: Accelerating Sparse-Dense Matrix Multiplication using GPU Tensor Cores

Lizhi Xiang, Omid Asudeh, Gerald Sabin, Aravind Sukumaran-Rajam, P. Sadayappan

TL;DR

The paper tackles the challenge of accelerating sparse–dense matrix multiplication (SpMM) on GPUs using dense Tensor Cores, which historically underperform on highly sparse patterns. It introduces a hierarchical, brick-based HRPB representation and a cuTeSpMM kernel that maps SpMM to Tensor Core MMA tiles, guided by a synergy metric (TCU-Synergy) and careful load balancing. The main contributions are the HRPB data structure, the high-performance cuTeSpMM kernel, a synergy-based analysis linking data layout to throughput, and extensive experiments across 1000+ SuiteSparse matrices showing significant gains over TC-GNN and scalar-core SpMM for matrices with high synergy. The work demonstrates that structured sparsity can be effectively exploited to unlock substantial SpMM speedups on modern GPUs, broadening the practical impact of Tensor Core-powered sparse computations in scientific computing and data analytics.

Abstract

Many recent GPUs feature matrix multiplication engines (aka Tensor Core Units or TCUs) that perform small fixed-size matrix-matrix products at very high throughput. They have been used very effectively to speed up dense matrix-matrix multiplication libraries like Nvidia's cuBLAS, enabling significantly higher performance over use of the traditional scalar GPU cores. There also been recent interest in using these dense TCUs for the important sparse-dense matrix-matrix multiplication (SpMM) kernel via explicit zero-filling. However, an examination of the attainable performance of TC-GNN, the state-of-the-art TCU-enhanced SpMM implementation, indicates that for a substantial majority of the sparse matrices in the SuiteSparse collection, the achieved performance falls significantly short of the state-of-the-art SpMM kernels that only utilize scalar cores. In this paper, we therefore address the question: Can dense TCUs be effectively used to accelerate SpMM for a range of sparse matrices arising from multiple application domains, such as those found in the SuiteSparse matrix collection? We answer this question in the affirmative by developing a very efficient TCU-based GPU kernel - cuTeSpMM (cuda Tensor core SpMM) that achieves substantially higher performance over TC-GNN. We also develop a notion of the TCU-Synergy of a sparse-matrix, based on its non-zero structure and a modeled Operational Intensity. For sparse matrices with high TCU-synergy, cuTeSpMM outperforms state-of-the-art scalar-core SpMM implementations, while achieving only slightly lower performance on matrices with low TCU-Synergy.

cuTeSpMM: Accelerating Sparse-Dense Matrix Multiplication using GPU Tensor Cores

TL;DR

The paper tackles the challenge of accelerating sparse–dense matrix multiplication (SpMM) on GPUs using dense Tensor Cores, which historically underperform on highly sparse patterns. It introduces a hierarchical, brick-based HRPB representation and a cuTeSpMM kernel that maps SpMM to Tensor Core MMA tiles, guided by a synergy metric (TCU-Synergy) and careful load balancing. The main contributions are the HRPB data structure, the high-performance cuTeSpMM kernel, a synergy-based analysis linking data layout to throughput, and extensive experiments across 1000+ SuiteSparse matrices showing significant gains over TC-GNN and scalar-core SpMM for matrices with high synergy. The work demonstrates that structured sparsity can be effectively exploited to unlock substantial SpMM speedups on modern GPUs, broadening the practical impact of Tensor Core-powered sparse computations in scientific computing and data analytics.

Abstract

Many recent GPUs feature matrix multiplication engines (aka Tensor Core Units or TCUs) that perform small fixed-size matrix-matrix products at very high throughput. They have been used very effectively to speed up dense matrix-matrix multiplication libraries like Nvidia's cuBLAS, enabling significantly higher performance over use of the traditional scalar GPU cores. There also been recent interest in using these dense TCUs for the important sparse-dense matrix-matrix multiplication (SpMM) kernel via explicit zero-filling. However, an examination of the attainable performance of TC-GNN, the state-of-the-art TCU-enhanced SpMM implementation, indicates that for a substantial majority of the sparse matrices in the SuiteSparse collection, the achieved performance falls significantly short of the state-of-the-art SpMM kernels that only utilize scalar cores. In this paper, we therefore address the question: Can dense TCUs be effectively used to accelerate SpMM for a range of sparse matrices arising from multiple application domains, such as those found in the SuiteSparse matrix collection? We answer this question in the affirmative by developing a very efficient TCU-based GPU kernel - cuTeSpMM (cuda Tensor core SpMM) that achieves substantially higher performance over TC-GNN. We also develop a notion of the TCU-Synergy of a sparse-matrix, based on its non-zero structure and a modeled Operational Intensity. For sparse matrices with high TCU-synergy, cuTeSpMM outperforms state-of-the-art scalar-core SpMM implementations, while achieving only slightly lower performance on matrices with low TCU-Synergy.

Paper Structure

This paper contains 19 sections, 7 equations, 12 figures, 2 tables, 1 algorithm.

Figures (12)

  • Figure 2: TC-GNN performance versus the best SC performance (cuSparse cusparse, GeSpMM huang2020ge, Sputnik gale2020sparse). Left: Ampere a100, right:RTX-4090
  • Figure 3: Sparse Compression Process
  • Figure 4: Block Data Strucutre
  • Figure 5: HRPB data structure
  • Figure 6: Data Transfer from Dense Matrix into the Shared Memory
  • ...and 7 more figures