Communication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU Interconnects

Julian Bellavita; Lorenzo Pichetti; Thomas Pasquali; Flavio Vella; Giulia Guidi

Communication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU Interconnects

Julian Bellavita, Lorenzo Pichetti, Thomas Pasquali, Flavio Vella, Giulia Guidi

Abstract

The multiplication of two sparse matrices, known as SpGEMM, is a key kernel in scientific computing and large-scale data analytics, underpinning graph algorithms, machine learning, simulations, and computational biology, where sparsity is often highly unstructured. The unstructured sparsity makes achieving high performance challenging because it limits both memory efficiency and scalability. In distributed memory, the cost of exchanging and merging partial products across nodes further constrains performance. These issues are exacerbated on modern heterogeneous supercomputers with deep, hierarchical GPU interconnects. Current SpGEMM implementations overlook the gap between intra-node and inter-node bandwidth, resulting in unnecessary data movement and synchronization not fully exploiting the fast intra-node interconnect. To address these challenges, we introduce Trident, a hierarchy-aware 2D distributed SpGEMM algorithm that uses communication-avoiding techniques and asynchronous communication to exploit the hierarchical and heterogeneous architecture of modern supercomputing interconnect. Central to Trident is the novel trident partitioning scheme, which enables hierarchy-aware decomposition and reduces internode communication by leveraging the higher bandwidth between GPUs within a node compared to across nodes. Here, we evaluate Trident on unstructured matrices, achieving up to $2.38\times$ speedup over a 2D SpGEMM with a corresponding geometric mean speedup of $1.54\times$. Trident reduces internode communication volume by up to $2\times$ on NERSC's Perlmutter supercomputer. Furthermore, we demonstrate the effectiveness of Trident in speeding up Markov Clustering, achieving up to $2\times$ speedup compared to competing strategies.

Communication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU Interconnects

Abstract

speedup over a 2D SpGEMM with a corresponding geometric mean speedup of

. Trident reduces internode communication volume by up to

on NERSC's Perlmutter supercomputer. Furthermore, we demonstrate the effectiveness of Trident in speeding up Markov Clustering, achieving up to

speedup compared to competing strategies.

Paper Structure (38 sections, 1 theorem, 2 equations, 11 figures, 2 tables, 2 algorithms)

This paper contains 38 sections, 1 theorem, 2 equations, 11 figures, 2 tables, 2 algorithms.

Introduction
Background and Related Work
Distributed SpGEMM
1D and Sparsity-Aware Distributed SpGEMM
2D Distributed SpGEMM
3D Distributed SpGEMM
Hierarchical Network
Hierarchy-Aware Graph Computation
Hierarchy-Aware Collective Communication
Algorithm
Hierarchical Network Definition
Trident Partitioning
Trident Algorithm
Trident Outer 2D Algorithm
Trident Inner 1D Algorithm
...and 23 more sections

Key Result

proposition 1

For unstructured matrices with $nnz/P$ nonzeros per tile, each iteration requires processor $P_{ij}$ to fetch tiles $\mathbf{A}\xspace_{ir}$ and $\mathbf{B}\xspace_{rj}$ from two remote processes, each transferring $nnz/P$ nonzeros over the global interconnect Gi. Then, including the intranode $\mat

Figures (11)

Figure 1: A model of a hierarchical interconnect and compute subsystem, where each node contains four GPUs connected internally via Li and across nodes via Gi.
Figure 2: Each process is mapped to a GPU on a node. (a) In Sparsa SUMMA, processes are organized in a $\sqrt{P} \times \sqrt{P}$ grid. (b) In Trident, processes are organized using a hybrid 2D $+$ 1D scheme, resulting in a $\sqrt{P/\lambda} \times \sqrt{P/\lambda} \times \lambda$ grid.
Figure 3: An example of internode communication between $N_{ir}$ and $N_{ij}$ for $\lambda = 4$. The 2D tile $\mathbf{A}\xspace_{ir}$ is distributed across the $\lambda$ processes $P_{ir:}$. During the internode communication, each process $P_{irk}$ sends its own tile to $P_{ijk}$ through Gi.
Figure 4: $P_{ij3}$ is a process associated with one of the 1D slices within a given 2D tile. $P_{ij3}$ is mapped to one of the local GPUs within the node connected to other local GPUs via Li and to GPUs on other nodes via Gi. In round $r$, $\mathbf{A}\xspace_{ir3}$ and $\mathbf{B}\xspace_{rj3}$ will come through Gi during node-to-node communication, while the other $\mathbf{B}\xspace_{rj:}$ will come through Li during the $\mathsf{Allgatherv}$.
Figure 5: A snapshot of the three-thread model and asynchronous internode communication in a single SpGEMM round, where the main thread of $P_{ijk}$ requests $\mathbf{A}\xspace$ tiles using MPI_Put and the communication thread provides them through CUDA-aware MPI.
...and 6 more figures

Theorems & Definitions (1)

proposition 1: Per-process communication volume

Communication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU Interconnects

Abstract

Communication-Avoiding SpGEMM via Trident Partitioning on Hierarchical GPU Interconnects

Authors

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (1)