Table of Contents
Fetching ...

NM-SpMM: Accelerating Matrix Multiplication Using N:M Sparsity with GPGPU

Cong Ma, Du Wu, Zhelang Deng, Jiang Chen, Xiaowen Huang, Jintao Meng, Wenxi Zhu, Bingqiang Wang, Amelie Chi Zhou, Peng Chen, Minwen Deng, Yanjie Wei, Shengzhong Feng, Yi Pan

TL;DR

This paper tackles the high resource demands of deploying dense DL models by leveraging N:M sparsity to convert dense GEMMs into semi-sparse SpMM. It introduces NM-SpMM, a flexible vector-wise N:M sparsity implementation for GPUs, underpinned by a systematic top-down performance analysis and a hierarchical blocking framework combined with sparsity-aware optimizations. The approach achieves near-peak performance across sparsity levels, significantly outperforming state-of-the-art dense and sparse baselines (e.g., up to 2.1× faster than nmSPARSE and 1.4×–6.3× faster than cuBLAS), and is released as open source. The work demonstrates how careful data locality, memory-footprint reduction, and latency-hiding pipelines can unlock the practical benefits of N:M sparsity for real-world neural network inference.

Abstract

Deep learning demonstrates effectiveness across a wide range of tasks. However, the dense and over-parameterized nature of these models results in significant resource consumption during deployment. In response to this issue, weight pruning, particularly through N:M sparsity matrix multiplication, offers an efficient solution by transforming dense operations into semi-sparse ones. N:M sparsity provides an option for balancing performance and model accuracy, but introduces more complex programming and optimization challenges. To address these issues, we design a systematic top-down performance analysis model for N:M sparsity. Meanwhile, NM-SpMM is proposed as an efficient general N:M sparsity implementation. Based on our performance analysis, NM-SpMM employs a hierarchical blocking mechanism as a general optimization to enhance data locality, while memory access optimization and pipeline design are introduced as sparsity-aware optimization, allowing it to achieve close-to-theoretical peak performance across different sparsity levels. Experimental results show that NM-SpMM is 2.1x faster than nmSPARSE (the state-of-the-art for general N:M sparsity) and 1.4x to 6.3x faster than cuBLAS's dense GEMM operations, closely approaching the theoretical maximum speedup resulting from the reduction in computation due to sparsity. NM-SpMM is open source and publicly available at https://github.com/M-H482/NM-SpMM.

NM-SpMM: Accelerating Matrix Multiplication Using N:M Sparsity with GPGPU

TL;DR

This paper tackles the high resource demands of deploying dense DL models by leveraging N:M sparsity to convert dense GEMMs into semi-sparse SpMM. It introduces NM-SpMM, a flexible vector-wise N:M sparsity implementation for GPUs, underpinned by a systematic top-down performance analysis and a hierarchical blocking framework combined with sparsity-aware optimizations. The approach achieves near-peak performance across sparsity levels, significantly outperforming state-of-the-art dense and sparse baselines (e.g., up to 2.1× faster than nmSPARSE and 1.4×–6.3× faster than cuBLAS), and is released as open source. The work demonstrates how careful data locality, memory-footprint reduction, and latency-hiding pipelines can unlock the practical benefits of N:M sparsity for real-world neural network inference.

Abstract

Deep learning demonstrates effectiveness across a wide range of tasks. However, the dense and over-parameterized nature of these models results in significant resource consumption during deployment. In response to this issue, weight pruning, particularly through N:M sparsity matrix multiplication, offers an efficient solution by transforming dense operations into semi-sparse ones. N:M sparsity provides an option for balancing performance and model accuracy, but introduces more complex programming and optimization challenges. To address these issues, we design a systematic top-down performance analysis model for N:M sparsity. Meanwhile, NM-SpMM is proposed as an efficient general N:M sparsity implementation. Based on our performance analysis, NM-SpMM employs a hierarchical blocking mechanism as a general optimization to enhance data locality, while memory access optimization and pipeline design are introduced as sparsity-aware optimization, allowing it to achieve close-to-theoretical peak performance across different sparsity levels. Experimental results show that NM-SpMM is 2.1x faster than nmSPARSE (the state-of-the-art for general N:M sparsity) and 1.4x to 6.3x faster than cuBLAS's dense GEMM operations, closely approaching the theoretical maximum speedup resulting from the reduction in computation due to sparsity. NM-SpMM is open source and publicly available at https://github.com/M-H482/NM-SpMM.

Paper Structure

This paper contains 20 sections, 6 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: An example demonstrates how vector-wise N:M sparsity works, where $N=2$, $M=4$, and the vector length $L=4$.
  • Figure 2: Overview of the Workflow and Internals of NM-SpMM.
  • Figure 3: A hierarchical pipeline of the NM-SpMM GPU implementation.
  • Figure 4: Offline pre-processing of the index matrix in high sparsity scenarios: obtaining $col\_info$, rearranging indices, and changing the data layout. During computation, online packing of $A_s$ reduces memory footprint and enhances arithmetic intensity.
  • Figure 5: Pipeline of NM-SpMM for moderate sparsity scenario: utilizing computation instructions to mask latency of load instructions from global memory to shared memory.
  • ...and 5 more figures