Improving Locality in Sparse and Dense Matrix Multiplications

Mohammad Mahdi Salehi Dezfuli; Kazem Cheshmi

Improving Locality in Sparse and Dense Matrix Multiplications

Mohammad Mahdi Salehi Dezfuli, Kazem Cheshmi

TL;DR

Tile fusion addresses locality in the computation $D = A (B C)$ by fusing tiles across two matrix multiplications when $A$ is sparse. It introduces a sparsity-aware scheduler with a two-step process (coarse tile fusion and cache-aware splitting) and a fused code path that preserves the memory hierarchy benefits of GeMM and SpMM, implemented with OpenMP parallelism and vectorization. Across 233 SuiteSparse matrices on multi-core CPUs, tile fusion achieves substantial speedups over unfused baselines and prior fused approaches, with observed two-way improvements in many cases and strong scalability to dozens of cores. The approach relies on a data-movement cost model to bound tile sizes within fast memory and on a two-wavefront schedule to minimize synchronization while maintaining load balance, making it practical for graph neural networks and sparse solvers that reuse intermediates.

Abstract

Consecutive matrix multiplications are commonly used in graph neural networks and sparse linear solvers. These operations frequently access the same matrices for both reading and writing. While reusing these matrices improves data locality, it presents a challenge due to the irregular dependencies between iterations across the two multiplication operations. Existing fusion methods often introduce excessive synchronization overhead or overlapped computations with limited benefits. This paper proposes tile fusion, a runtime approach that fuses tiles of the two matrix-matrix multiplications, where at least one of the involved matrices is sparse. Tile fusion aims to improve data locality while providing sufficient workload for cores in shared-memory multi-core processors. For a pair of matrix-matrix multiplications, tile fusion outperforms unfused baseline and MKL implementations with a geometric mean speedup of 1.97$\times$ 1.64$\times$, respectively, on multi-core CPUs.

Improving Locality in Sparse and Dense Matrix Multiplications

TL;DR

Tile fusion addresses locality in the computation

by fusing tiles across two matrix multiplications when

is sparse. It introduces a sparsity-aware scheduler with a two-step process (coarse tile fusion and cache-aware splitting) and a fused code path that preserves the memory hierarchy benefits of GeMM and SpMM, implemented with OpenMP parallelism and vectorization. Across 233 SuiteSparse matrices on multi-core CPUs, tile fusion achieves substantial speedups over unfused baselines and prior fused approaches, with observed two-way improvements in many cases and strong scalability to dozens of cores. The approach relies on a data-movement cost model to bound tile sizes within fast memory and on a two-wavefront schedule to minimize synchronization while maintaining load balance, making it practical for graph neural networks and sparse solvers that reuse intermediates.

Abstract

1.64

, respectively, on multi-core CPUs.

Paper Structure (19 sections, 3 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 3 equations, 12 figures, 3 tables, 1 algorithm.

Introduction
Motivating Example
Tile Fusion
Scheduler
Step 1
Step 2
Fused Code
Experimental Results
Setup
Environment
Matrix Dataset
Fused and Unfused Implementations
GEMM-SpMM Evaluation
Performance in FLOPs
Ablation Study
...and 4 more sections

Figures (12)

Figure 1: The ratio of computations in coarse fused tiles for all matrices from SuiteSparse for GEMM-SpMM operation.
Figure 2: Three different iteration fusion schedules (Figure \ref{['fig:motivation']}d--f) for the GeMM-SpMM in Figure \ref{['fig:motivation']}b and the matrix in Figure \ref{['fig:motivation']}a. Figure \ref{['fig:motivation']}c shows the dependence DAG between iterations of the outermost loop of GeMM and SpMM, where colored and white vertices correspond to GeMM and SpMM iterations, respectively. Dark solid lines show synchronization barriers, the dotted red line shows a potential race condition, and vertical dashed lines show per thread workload.
Figure 3: Tile fusion schedule after step 1
Figure 4: Variation of fused ratio versus tile size.
Figure 5: GeMM-SpMM performance for all matrices on CascadeLAke (top) EPYC (bottom)
...and 7 more figures

Improving Locality in Sparse and Dense Matrix Multiplications

TL;DR

Abstract

Improving Locality in Sparse and Dense Matrix Multiplications

Authors

TL;DR

Abstract

Table of Contents

Figures (12)