High Performance Unstructured SpMM Computation Using Tensor Cores

Patrik Okanovic; Grzegorz Kwasniewski; Paolo Sylos Labini; Maciej Besta; Flavio Vella; Torsten Hoefler

High Performance Unstructured SpMM Computation Using Tensor Cores

Patrik Okanovic, Grzegorz Kwasniewski, Paolo Sylos Labini, Maciej Besta, Flavio Vella, Torsten Hoefler

TL;DR

SMaT addresses the challenge of accelerating SpMM on unstructured sparsity by converting CSR inputs to a block-based CSR format and executing a Tensor Core–aware, 2D bottom-up kernel. It combines a row-wise permutation preprocessing to densify blocks with a highly optimized CUDA implementation that uses the MMA API and asynchronous data transfers. The approach yields up to 125x speedups over cuSPARSE and up to 2,445x over cuSPARSE on synthetic matrices, with strong improvements across real-world SuiteSparse matrices as well, particularly at higher sparsity and larger dense matrix widths. The results demonstrate that hardware-aware blocking and careful data movement can unlock Tensor Core performance for general SpMM, broadening the applicability to scientific computing, large-model training, and inference tasks.

Abstract

High-performance sparse matrix-matrix (SpMM) multiplication is paramount for science and industry, as the ever-increasing sizes of data prohibit using dense data structures. Yet, existing hardware, such as Tensor Cores (TC), is ill-suited for SpMM, as it imposes strict constraints on data structures that cannot be met by unstructured sparsity found in many applications. To address this, we introduce (S)parse (Ma)trix Matrix (T)ensor Core-accelerated (SMaT): a novel SpMM library that utilizes TCs for unstructured sparse matrices. Our block-sparse library leverages the low-level CUDA MMA (matrix-matrix-accumulate) API, maximizing the performance offered by modern GPUs. Algorithmic optimizations such as sparse matrix permutation further improve performance by minimizing the number of non-zero blocks. The evaluation on NVIDIA A100 shows that SMaT outperforms SotA libraries (DASP, cuSPARSE, and Magicube) by up to 125x (on average 2.6x). SMaT can be used to accelerate many workloads in scientific computing, large-model training, inference, and others.

High Performance Unstructured SpMM Computation Using Tensor Cores

TL;DR

Abstract

Paper Structure (34 sections, 2 equations, 10 figures, 1 table, 1 algorithm)

This paper contains 34 sections, 2 equations, 10 figures, 1 table, 1 algorithm.

Introduction
Background
Hardware execution model
Execution model
Tensor Cores
Memory model
Sparse matrix representation
Unstructured sparsity
Structured sparsity
Blocked format
Performance model
Single instruction time $T_e$
Number of elementary computations $n_e$
SMaT — (S)parse (Ma)trix Matrix (T)ensor Core-accelerated library
Overview
...and 19 more sections

Figures (10)

Figure 1: A bird's-eye view of the entire SMaT's pipeline. SMaT performs SpMM on an input matrix in the CSR format stored in any precision supported by Tensor Cores. Then, it preprocesses the matrix to maximize the block density, minimize the total number of blocks, and maximize the load balance between rows. The preprocessing is done only once and the matrix is internally stored in the BCSR (Blocked-CSR) format. When the SpMM kernel is launched, an optimized CUDA kernel uses block-level bottom-up 2D parallelism to maximize the utilization of GPU hardware resources. The results - both on the SuiteSparse and on synthetic matrices confirm the universality of SMaT: it significantly outperforms remaining solutions in almost every test case: for very sparse and relatively dense matrices, for highly unstructured and for very regular matrices.
Figure 2: Performance measurements vs. model (Equation \ref{['eq:perf']}) for various combinations of low-level optimizations: C: warp-cooperative asynchronous loading from global to shared memory using memcpy_async; B: using BCSR pointer array to skip empty-block evaluation in the inner loop; T: using TC MMA API (MMA16816)
Figure 3: Distribution of the blocks count per row in the BCSR format in the input matrix (original), after the row reordering, and after row and column reordering for the test matrices. For cop20k_A, row reordering reduces the number of BCSR blocks by 2.5x and the standard deviation by 3x. For mip1, while the reduction of the total block count is slightly smaller (1.8x), the standard deviation is reduced by 8.4x, significantly improving the load balance for our 2D parallel schedule. Matrix dc2 is the most adversarial for SMaT: with its extreme sparsity and power-law distribution of nonzeros per row, the runtime cannot utilize tensor cores, and the warp-level static schedule generates high load imbalance on SMs.
Figure 4: Reordering effect on the performance of SMaT on 9 representative matrices from SuiteSparse.
Figure 5: Reordering effect on the performance of DASP on 9 representative matrices from SuiteSparse.
...and 5 more figures

High Performance Unstructured SpMM Computation Using Tensor Cores

TL;DR

Abstract

High Performance Unstructured SpMM Computation Using Tensor Cores

Authors

TL;DR

Abstract

Table of Contents

Figures (10)