Table of Contents
Fetching ...

Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication

Isuru Ranawaka, Md Taufique Hussain, Charles Block, Gerasimos Gerogiannis, Josep Torrellas, Ariful Azad

TL;DR

This work addresses the inefficiency of existing distributed SpGEMM algorithms for the tall-and-skinny case by introducing TS-SpGEMM, a Gustavson-inspired, 1-D partitioned, tile-based algorithm that simultaneously reduces memory usage and communication. It introduces local and remote compute modes and a tile-mode selection mechanism, enabling sparsity-aware tiling and adaptive accumulator choice (SPA vs hash) to outperform traditional Sparse SUMMA variants by about 5x on average and scale to 512 nodes on Perlmutter. Two graph-oriented applications, multi-source BFS and sparse embedding (Force2Vec), are demonstrated, achieving substantial speedups and scalability, with BFS showing up to 10x improvement over SUMMA-enabled baselines. While requiring two copies of $\mathbf{A}$ and inheriting some load-balancing challenges from 1-D partitioning, the approach delivers practical, scalable performance improvements for TS-SpGEMM-driven graph analytics and AMG setup phases, with potential extensions to SpMM and fused-mmm routines.

Abstract

We consider a sparse matrix-matrix multiplication (SpGEMM) setting where one matrix is square and the other is tall and skinny. This special variant, called TS-SpGEMM, has important applications in multi-source breadth-first search, influence maximization, sparse graph embedding, and algebraic multigrid solvers. Unfortunately, popular distributed algorithms like sparse SUMMA deliver suboptimal performance for TS-SpGEMM. To address this limitation, we develop a novel distributed-memory algorithm tailored for TS-SpGEMM. Our approach employs customized 1D partitioning for all matrices involved and leverages sparsity-aware tiling for efficient data transfers. In addition, it minimizes communication overhead by incorporating both local and remote computations. On average, our TS-SpGEMM algorithm attains 5x performance gains over 2D and 3D SUMMA. Furthermore, we use our algorithm to implement multi-source breadth-first search and sparse graph embedding algorithms and demonstrate their scalability up to 512 Nodes (or 65,536 cores) on NERSC Perlmutter.

Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication

TL;DR

This work addresses the inefficiency of existing distributed SpGEMM algorithms for the tall-and-skinny case by introducing TS-SpGEMM, a Gustavson-inspired, 1-D partitioned, tile-based algorithm that simultaneously reduces memory usage and communication. It introduces local and remote compute modes and a tile-mode selection mechanism, enabling sparsity-aware tiling and adaptive accumulator choice (SPA vs hash) to outperform traditional Sparse SUMMA variants by about 5x on average and scale to 512 nodes on Perlmutter. Two graph-oriented applications, multi-source BFS and sparse embedding (Force2Vec), are demonstrated, achieving substantial speedups and scalability, with BFS showing up to 10x improvement over SUMMA-enabled baselines. While requiring two copies of and inheriting some load-balancing challenges from 1-D partitioning, the approach delivers practical, scalable performance improvements for TS-SpGEMM-driven graph analytics and AMG setup phases, with potential extensions to SpMM and fused-mmm routines.

Abstract

We consider a sparse matrix-matrix multiplication (SpGEMM) setting where one matrix is square and the other is tall and skinny. This special variant, called TS-SpGEMM, has important applications in multi-source breadth-first search, influence maximization, sparse graph embedding, and algebraic multigrid solvers. Unfortunately, popular distributed algorithms like sparse SUMMA deliver suboptimal performance for TS-SpGEMM. To address this limitation, we develop a novel distributed-memory algorithm tailored for TS-SpGEMM. Our approach employs customized 1D partitioning for all matrices involved and leverages sparsity-aware tiling for efficient data transfers. In addition, it minimizes communication overhead by incorporating both local and remote computations. On average, our TS-SpGEMM algorithm attains 5x performance gains over 2D and 3D SUMMA. Furthermore, we use our algorithm to implement multi-source breadth-first search and sparse graph embedding algorithms and demonstrate their scalability up to 512 Nodes (or 65,536 cores) on NERSC Perlmutter.
Paper Structure (23 sections, 1 equation, 13 figures, 5 tables, 3 algorithms)

This paper contains 23 sections, 1 equation, 13 figures, 5 tables, 3 algorithms.

Figures (13)

  • Figure 1: Distributed memory Gustavson's algorithm using a toy example of $10\times10$ square sparse matrix and $10\times5$ tall-and-skinny sparse matrix distributed over $3$ processes in a row partitioned manner (dark lines represent process boundary). Shaded rows of $\mathbf{A}$ and $\mathbf{B}$ represent parts of the input matrices involved in the computation of the shaded row of $\mathbf{C}$. Shaded elements in the $nzc$ vectors represent columns with at least one non-zero (non-zero columns) of the local matrix owned by each process. Thus, the nzc vector of each process determines which rows of $\mathbf{B}$ are accessed by this process. Note that while the sparsity patterns of $\mathbf{A}_1$ and $\mathbf{A}_2$ differ significantly, both require all but one row of $\mathbf{B}$.
  • Figure 2: Distribution of the same matrices as in Fig.\ref{['fig:dist-gustavson']} involved in our algorithm. The highlighted regions in each subfigure represent (a) a $3\times3$ tile in $\mathbf{A}$, (b) the same tile in $\mathbf{A}^c$, (c) the rows of $\mathbf{B}$ used by this tile, and (d) the nonzeros in $\mathbf{C}$ produced by these tiles.
  • Figure 3: $P_2$ decides the mode of two tiles shown in red and blue within $\mathbf{A}$. To compute $\mathbf{C}_1$, the red tile is multiplied with the entire $\mathbf{B}_2$. Since $\textit{nnz}(\mathbf{B}_2)$ is greater than the affected nonzeros in $\mathbf{C}_1$, the red tile is marked as a remote tile within $\mathbf{A}_1$. To compute $\mathbf{C}_3$, the blue tile is multiplied with just one nonzero in $\mathbf{B}_2$ shown in blue. In this case, it is beneficial to communicate necessary data from $\mathbf{B}$ and hence, the blue tile is marked as a local tile within $\mathbf{A}_3$.
  • Figure 4: (a) An illustration of the force-directed node embedding, where neighboring vertices generate attractive forces and non-neighboring vertices generate repulsive forces. (b) The matrix $\mathbf{A}$ represents the adjacency matrix of the graph and $\mathbf{Z}^{\hbox{\scriptsize \sf T}}$ represents the sparse embedding matrix. Then, force computations are mapped to a TS-SpGEMM operation. (c) The minibatch computation, where we set tile height to be equal to the batch size.
  • Figure 5: The impact of an increasing tile width on memory and runtime on $8$ nodes ($64$ processes). The x-axis shows the width of the tile ranging from $n/p$ to $n$, expressed as multiples of $n/p$. The left subfigure shows the increase in memory consumption, while the right subfigure shows the impact on the runtime.
  • ...and 8 more figures