Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication
Isuru Ranawaka, Md Taufique Hussain, Charles Block, Gerasimos Gerogiannis, Josep Torrellas, Ariful Azad
TL;DR
This work addresses the inefficiency of existing distributed SpGEMM algorithms for the tall-and-skinny case by introducing TS-SpGEMM, a Gustavson-inspired, 1-D partitioned, tile-based algorithm that simultaneously reduces memory usage and communication. It introduces local and remote compute modes and a tile-mode selection mechanism, enabling sparsity-aware tiling and adaptive accumulator choice (SPA vs hash) to outperform traditional Sparse SUMMA variants by about 5x on average and scale to 512 nodes on Perlmutter. Two graph-oriented applications, multi-source BFS and sparse embedding (Force2Vec), are demonstrated, achieving substantial speedups and scalability, with BFS showing up to 10x improvement over SUMMA-enabled baselines. While requiring two copies of $\mathbf{A}$ and inheriting some load-balancing challenges from 1-D partitioning, the approach delivers practical, scalable performance improvements for TS-SpGEMM-driven graph analytics and AMG setup phases, with potential extensions to SpMM and fused-mmm routines.
Abstract
We consider a sparse matrix-matrix multiplication (SpGEMM) setting where one matrix is square and the other is tall and skinny. This special variant, called TS-SpGEMM, has important applications in multi-source breadth-first search, influence maximization, sparse graph embedding, and algebraic multigrid solvers. Unfortunately, popular distributed algorithms like sparse SUMMA deliver suboptimal performance for TS-SpGEMM. To address this limitation, we develop a novel distributed-memory algorithm tailored for TS-SpGEMM. Our approach employs customized 1D partitioning for all matrices involved and leverages sparsity-aware tiling for efficient data transfers. In addition, it minimizes communication overhead by incorporating both local and remote computations. On average, our TS-SpGEMM algorithm attains 5x performance gains over 2D and 3D SUMMA. Furthermore, we use our algorithm to implement multi-source breadth-first search and sparse graph embedding algorithms and demonstrate their scalability up to 512 Nodes (or 65,536 cores) on NERSC Perlmutter.
