Table of Contents
Fetching ...

Sparsity-Aware Communication for Distributed Graph Neural Network Training

Ujjaini Mukhodopadhyay, Alok Tripathy, Oguz Selvitopi, Katherine Yelick, Aydin Buluc

TL;DR

This work tackles the communication bottleneck in distributed full-batch Graph Neural Network (GNN) training by introducing sparsity-aware SpMM algorithms that communicate only necessary data. It develops 1D and 1.5D sparsity-aware methods and leverages graph partitioning to minimize both total and maximum inter-process communication, with Graph-VB (GVB) outperforming traditional METIS in many scenarios. The approach demonstrates up to 14X speedups on 256 GPUs and, on some graphs, reduces communication to near-zero relative to sparsity-oblivious baselines, validating its practicality for large-scale GNN training. The results suggest that sparsity-aware communication, combined with multi-criteria partitioning, can significantly enhance scalability and may extend to other communication-avoiding schemes and partitions.

Abstract

Graph Neural Networks (GNNs) are a computationally efficient method to learn embeddings and classifications on graph data. However, GNN training has low computational intensity, making communication costs the bottleneck for scalability. Sparse-matrix dense-matrix multiplication (SpMM) is the core computational operation in full-graph training of GNNs. Previous work parallelizing this operation focused on sparsity-oblivious algorithms, where matrix elements are communicated regardless of the sparsity pattern. This leads to a predictable communication pattern that can be overlapped with computation and enables the use of collective communication operations at the expense of wasting significant bandwidth by communicating unnecessary data. We develop sparsity-aware algorithms that tackle the communication bottlenecks in GNN training with three novel approaches. First, we communicate only the necessary matrix elements. Second, we utilize a graph partitioning model to reorder the matrix and drastically reduce the amount of communicated elements. Finally, we address the high load imbalance in communication with a tailored partitioning model, which minimizes both the total communication volume and the maximum sending volume. We further couple these sparsity-exploiting approaches with a communication-avoiding approach (1.5D parallel SpMM) in which submatrices are replicated to reduce communication. We explore the tradeoffs of these combined optimizations and show up to 14X improvement on 256 GPUs and on some instances reducing communication to almost zero resulting in a communication-free parallel training relative to a popular GNN framework based on communication-oblivious SpMM.

Sparsity-Aware Communication for Distributed Graph Neural Network Training

TL;DR

This work tackles the communication bottleneck in distributed full-batch Graph Neural Network (GNN) training by introducing sparsity-aware SpMM algorithms that communicate only necessary data. It develops 1D and 1.5D sparsity-aware methods and leverages graph partitioning to minimize both total and maximum inter-process communication, with Graph-VB (GVB) outperforming traditional METIS in many scenarios. The approach demonstrates up to 14X speedups on 256 GPUs and, on some graphs, reduces communication to near-zero relative to sparsity-oblivious baselines, validating its practicality for large-scale GNN training. The results suggest that sparsity-aware communication, combined with multi-criteria partitioning, can significantly enhance scalability and may extend to other communication-avoiding schemes and partitions.

Abstract

Graph Neural Networks (GNNs) are a computationally efficient method to learn embeddings and classifications on graph data. However, GNN training has low computational intensity, making communication costs the bottleneck for scalability. Sparse-matrix dense-matrix multiplication (SpMM) is the core computational operation in full-graph training of GNNs. Previous work parallelizing this operation focused on sparsity-oblivious algorithms, where matrix elements are communicated regardless of the sparsity pattern. This leads to a predictable communication pattern that can be overlapped with computation and enables the use of collective communication operations at the expense of wasting significant bandwidth by communicating unnecessary data. We develop sparsity-aware algorithms that tackle the communication bottlenecks in GNN training with three novel approaches. First, we communicate only the necessary matrix elements. Second, we utilize a graph partitioning model to reorder the matrix and drastically reduce the amount of communicated elements. Finally, we address the high load imbalance in communication with a tailored partitioning model, which minimizes both the total communication volume and the maximum sending volume. We further couple these sparsity-exploiting approaches with a communication-avoiding approach (1.5D parallel SpMM) in which submatrices are replicated to reduce communication. We explore the tradeoffs of these combined optimizations and show up to 14X improvement on 256 GPUs and on some instances reducing communication to almost zero resulting in a communication-free parallel training relative to a popular GNN framework based on communication-oblivious SpMM.

Paper Structure

This paper contains 31 sections, 10 equations, 7 figures, 3 tables, 2 algorithms.

Figures (7)

  • Figure 1: The partitioning of $\mathbf{A}^{\hbox{\scriptsize \sf T}}$ and $\mathbf{H}$ in sparsity-aware 1D algorithm with $4$ processes. Boldly shaded columns in the first block row of $\mathbf{A}^{\hbox{\scriptsize \sf T}}$ and $\mathbf{H}^{l-1}$ indicate the non-empty columns that require respective rows of $\mathbf{H}$ needs to be received by P0.
  • Figure 2: The partitioning of $\mathbf{A}^{\hbox{\scriptsize \sf T}}$ and $\mathbf{H}$ in sparsity-aware 1.5D algorithm among eight processes with a replication factor of 2 ($c=2$) in sparsity-aware 1.5D algorithm. Boldly shaded columns in the first block row of $\mathbf{A}^{\hbox{\scriptsize \sf T}}$ and $\mathbf{H}^{l-1}$ indicate the non-empty columns that require respective rows of $\mathbf{H}$, which will be received by P0 and P1.
  • Figure 3: 1D performance results for sparsity-oblivious, sparsity-aware, and sparsity-aware + GVB graph partitioning implementations. Note that these are log-log plots of the number of GPUs versus the time for a single epoch. For Reddit, we use $p=4, 16, 32, 64$. For Amazon and Protein datasets, we also use $p=128$ and $256$. Missing data in the line segments on Amazon for $p=4$ and on Protein for $p=4$ means that this trial of the experiment ran out of memory.
  • Figure 4: 1D performance breakdown. The x-axis of each plot refers to the number of GPUs used. This breakdown includes local computation, alltoall, and bcast. We compare results against CAGNET tripathy2020reducing. SA represents just a sparsity-aware implementation, and SA + GVB refers to our sparsity-aware implementation used in conjunction with GVB graph partitioning. The sparsity-oblivious CAGNET implementation involves the broadcast and local computation, which in this case consists of the local SpMM computations. The sparsity-aware implementations used in the middle and right bar involves a single all-to-all call and a series of local computations, which includes gathering the data to send, allocating space in GPU memory, and the local SpMM computation.
  • Figure 5: 1D performance results for sparsity-oblivious, sparsity-aware, and graph partitioning for Papers dataset for $p=16$ processes. This breakdown includes local computation, alltoall, and bcast, exactly like Figure \ref{['fig:1dbreakdown']}.
  • ...and 2 more figures