A sparsity-aware distributed-memory algorithm for sparse-sparse matrix multiplication
Yuxi Hong, Aydin Buluc
TL;DR
This work introduces a sparsity-aware 1D SpGEMM algorithm for distributed-memory systems that fetches only the A data blocks needed for local computation using MPI RDMA and a block-fetch strategy. By preserving the original sparsity structure and optionally applying graph partitioning, it reduces communication significantly compared to sparsity-oblivious 2D/3D approaches, and it achieves strong scalability on real-world sparse matrices. The method is implemented in CombBLAS with MPI+OpenMP and demonstrates substantial performance advantages in squaring, Galerkin-like restriction operations, and betweenness centrality workloads, particularly when partitioning is well-chosen. The paper also provides practical guidance on when to apply graph partitioning versus random permutation and discusses integration with existing solvers like PETSc and Trilinos, highlighting the approach as a high-performance primitive for SpGEMM-related applications.
Abstract
Multiplying two sparse matrices (SpGEMM) is a common computational primitive used in many areas including graph algorithms, bioinformatics, algebraic multigrid solvers, and randomized sketching. Distributed-memory parallel algorithms for SpGEMM have mainly focused on sparsity-oblivious approaches that use 2D and 3D partitioning. Sparsity-aware 1D algorithms can theoretically reduce communication by not fetching nonzeros of the sparse matrices that do not participate in the multiplication. Here, we present a distributed-memory 1D SpGEMM algorithm and implementation. It uses MPI RDMA operations to mitigate the cost of packing/unpacking submatrices for communication, and it uses a block fetching strategy to avoid excessive fine-grained messaging. Our results show that our 1D implementation outperforms state-of-the-art 2D and 3D implementations within CombBLAS for many configurations, inputs, and use cases, while remaining conceptually simpler.
