Table of Contents
Fetching ...

Masked Matrix Multiplication for Emergent Sparsity

Brian Wheatman, Meghana Madhyastha, Randal Burns

TL;DR

MMM addresses the inefficiency of dense GEMMs on transformer-like workloads by exploiting emergent, dual-sided sparsity through runtime masks $M(X)$ and $N(Y)$ and a pattern-driven, code-generated kernel table. The approach uses preprocessing to encode sparsity patterns in $B$ and dynamic code selection to skip zero blocks in $A$, maintaining vectorization with AVX2/AVX-512 and single-pass parallelism. Empirical results show up to 2x speedups and up to 4x fewer instructions over MKL across a wide sparsity range (60–95% zeros), with additional gains on mid-sized matrices and multi-core servers; performance depends on matrix size, sparsity distribution, and architecture. The work demonstrates a practical path to reduce cost, power, and time for sparsity-enabled AI workloads on CPUs and outlines clear directions for GPU extension and broader optimization.

Abstract

Artificial intelligence workloads, especially transformer models, exhibit emergent sparsity in which computations perform selective sparse access to dense data. The workloads are inefficient on hardware designed for dense computations and do not map well onto sparse data representations. We build a vectorized and parallel matrix-multiplication system A X B = C that eliminates unnecessary computations and avoids branches based on a runtime evaluation of sparsity. We use a combination of dynamic code lookup to adapt to the specific sparsity encoded in the B matrix and preprocessing of sparsity maps of the A and B matrices to compute conditional branches once for the whole computation. For a wide range of sparsity, from 60% to 95% zeros, our implementation performs fewer instructions and increases performance when compared with Intel MKL's dense or sparse matrix multiply routines. Benefits can be as large as 2 times speedup and 4 times fewer instructions.

Masked Matrix Multiplication for Emergent Sparsity

TL;DR

MMM addresses the inefficiency of dense GEMMs on transformer-like workloads by exploiting emergent, dual-sided sparsity through runtime masks and and a pattern-driven, code-generated kernel table. The approach uses preprocessing to encode sparsity patterns in and dynamic code selection to skip zero blocks in , maintaining vectorization with AVX2/AVX-512 and single-pass parallelism. Empirical results show up to 2x speedups and up to 4x fewer instructions over MKL across a wide sparsity range (60–95% zeros), with additional gains on mid-sized matrices and multi-core servers; performance depends on matrix size, sparsity distribution, and architecture. The work demonstrates a practical path to reduce cost, power, and time for sparsity-enabled AI workloads on CPUs and outlines clear directions for GPU extension and broader optimization.

Abstract

Artificial intelligence workloads, especially transformer models, exhibit emergent sparsity in which computations perform selective sparse access to dense data. The workloads are inefficient on hardware designed for dense computations and do not map well onto sparse data representations. We build a vectorized and parallel matrix-multiplication system A X B = C that eliminates unnecessary computations and avoids branches based on a runtime evaluation of sparsity. We use a combination of dynamic code lookup to adapt to the specific sparsity encoded in the B matrix and preprocessing of sparsity maps of the A and B matrices to compute conditional branches once for the whole computation. For a wide range of sparsity, from 60% to 95% zeros, our implementation performs fewer instructions and increases performance when compared with Intel MKL's dense or sparse matrix multiply routines. Benefits can be as large as 2 times speedup and 4 times fewer instructions.
Paper Structure (10 sections, 1 equation, 15 figures, 4 tables)

This paper contains 10 sections, 1 equation, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Runtime for a 2048x2048 matrix multiplication with varying sparsity. MMM outperforms Intel MKLs best algorithm for intermediate levels of sparsity, providing 2 times speedup at 80% zeros.
  • Figure 2: Number of instructions to perform a 2048x2048 matrix multiplication with varying sparsity. MMM reduces instructions be a factor of 4 at 70% zeros.
  • Figure 3: Simple dense matrix multiplication.
  • Figure 4: Simple recursive matrix multiplication implementation. UL, UR, LL, and LR stand for the upper left, upper right, lower left and lower right sub-matrices.
  • Figure 5: Types of sparsity. In random sparsity, any bit may be non-zero. Block-random sparsity assumes aligned, contiguous regions of each row are sparse or not sparse. The degree of block-random sparsity may be lower than the fraction of sparsity when 0s are not fully aligned. We also show random column and random block column sparsity. Block column patterns arise in Deja Vu liu2023deja.
  • ...and 10 more figures