MAGNUS: Generating Data Locality to Accelerate Sparse Matrix-Matrix Multiplication on CPUs

Jordi Wolfson-Pou; Jan Laukemann; Fabrizio Petrini

MAGNUS: Generating Data Locality to Accelerate Sparse Matrix-Matrix Multiplication on CPUs

Jordi Wolfson-Pou, Jan Laukemann, Fabrizio Petrini

TL;DR

This work introduces MAGNUS, a novel algorithm for sparse general matrix-matrix multiplication that maximizes data locality by reorganizing the intermediate product into cache-friendly chunks using a two-level (coarse- and fine-level) locality framework. It is accumulator-agnostic and supports a hybrid strategy that selects between an $AVX$-512 vectorized bitonic sort for small chunks and dense accumulation for larger ones, guided by system characteristics and matrix properties. Experiments show MAGNUS consistently outperforms multiple state-of-the-art baselines on SuiteSparse, RMat16, and large uniform random matrices, often by an order of magnitude, and scales to massive problems where baselines fail, approaching an ideal memory-bandwidth bound. The approach advances practical SpGEMM performance by enabling near-streaming execution through systematic cache-conscious dataflow and adaptive chunking across architectures.

Abstract

Sparse general matrix-matrix multiplication (SpGEMM) is a critical operation in many applications. Current multithreaded implementations are based on Gustavson's algorithm and often perform poorly on large matrices due to limited cache reuse by the accumulators. We present MAGNUS (Matrix Algebra for Gigantic NUmerical Systems), a novel algorithm to maximize data locality in SpGEMM. To generate locality, MAGNUS reorders the intermediate product into discrete cache-friendly chunks using a two-level hierarchical approach. The accumulator is applied to each chunk, where the chunk size is chosen such that the accumulator is cache-efficient. MAGNUS is input- and system-aware: based on the matrix characteristics and target system specifications, the optimal number of chunks is computed by minimizing the storage cost of the necessary data structures. MAGNUS allows for a hybrid accumulation strategy in which each chunk uses a different accumulator based on an input threshold. We consider two accumulators: an AVX-512 vectorized bitonic sorting algorithm and classical dense accumulation. An OpenMP implementation of MAGNUS is compared with several baselines, including Intel MKL, for a variety of different matrices on three Intel architectures. For matrices from the SuiteSparse collection, MAGNUS is faster than all the baselines in most cases and is often an order of magnitude faster than at least one baseline. For massive random matrices, MAGNUS scales to the largest matrix sizes, while the baselines do not. Furthermore, MAGNUS is close to the optimal bound for these matrices, regardless of the matrix size, structure, and density.

MAGNUS: Generating Data Locality to Accelerate Sparse Matrix-Matrix Multiplication on CPUs

TL;DR

-512 vectorized bitonic sort for small chunks and dense accumulation for larger ones, guided by system characteristics and matrix properties. Experiments show MAGNUS consistently outperforms multiple state-of-the-art baselines on SuiteSparse, RMat16, and large uniform random matrices, often by an order of magnitude, and scales to massive problems where baselines fail, approaching an ideal memory-bandwidth bound. The approach advances practical SpGEMM performance by enabling near-streaming execution through systematic cache-conscious dataflow and adaptive chunking across architectures.

Abstract

Paper Structure (17 sections, 9 equations, 9 figures, 3 tables, 4 algorithms)

This paper contains 17 sections, 9 equations, 9 figures, 3 tables, 4 algorithms.

Introduction
Background
Gustavson's Method
Related Work
MAGNUS
Overview
The Fine-level Algorithm
The Coarse-level Algorithm
Accumulation
Choosing the Number of Chunks
Experimental Results
Microbenchmarks
Accumulators
Building Blocks of MAGNUS
Test Configuration for SpGEMM
...and 2 more sections

Figures (9)

Figure 1: Example workflow of MAGNUS, where two threads multiply two $8\times8$ matrices. The computation of rows 1 and 2 assigned to thread $t_0$ is shown. Two coarse- and fine-level chunks are used for each row.
Figure 2: Data structure-view of applying the fine-level algorithm to the first chunk from \ref{['fig:MAGNUS_example']}.
Figure 3: Data structure-view of the coarse-level algorithm for the example in \ref{['fig:MAGNUS_example']}.
Figure 4: Comparison of accumulators used in MAGNUS: AVX-512 vectorized sorting and dense accumulation. A single core of the SPR system is used (see \ref{['tab:machines']}). The rate (millions of elements per second) versus the number of accumulated elements is shown. The dashed lines indicate where the sorting achieves peak performance (32 elements) and where the performance of dense accumulation overtakes that of sorting (256 elements).
Figure 5: Wall-clock time (top) and L2-to-L3 cache evictions (bottom) versus the number of chunks for a set of microbenchmarks that test the performance of the building blocks of MAGNUS. A single core of the SPR system is used (see \ref{['tab:machines']}). Total denotes the sum of the building block times, and Stream is a standard streaming benchmark that serves as the peak-performance baseline. The middle vertical dashed line denotes the optimal number of fine-level chunks. The left and right dashed lines denote the point at which the storage requirement of the reorder and dense accumulation arrays exceed the L2 cache size, respectively.
...and 4 more figures

MAGNUS: Generating Data Locality to Accelerate Sparse Matrix-Matrix Multiplication on CPUs

TL;DR

Abstract

MAGNUS: Generating Data Locality to Accelerate Sparse Matrix-Matrix Multiplication on CPUs

Authors

TL;DR

Abstract

Table of Contents

Figures (9)