Table of Contents
Fetching ...

Distributed Matrix-Based Sampling for Graph Neural Network Training

Alok Tripathy, Katherine Yelick, Aydin Buluc

TL;DR

This work tackles the bottleneck of sampling in distributed minibatch training for Graph Neural Networks by introducing a matrix based bulk sampling framework that expresses sampling as sparse matrix multiplications (SpGEMM). It enables sampling of multiple minibatches at once and scales to graphs that do not fit on a single device, through GPU based 1.5D distributed strategies and communication avoiding SpGEMM. The authors provide end to end pipeline, implement GraphSAGE and LADIES within this framework, and demonstrate substantial speedups over Quiver on large Open Graph Benchmark datasets and across multiple GPUs. The approach preserves model accuracy while delivering significant reductions in sampling time and improved scalability for both node wise and layer wise sampling algorithms. This work advances practical, scalable training of GNNs on very large graphs by tightly integrating matrix based sampling with distributed sparse linear algebra techniques.

Abstract

Graph Neural Networks (GNNs) offer a compact and computationally efficient way to learn embeddings and classifications on graph data. GNN models are frequently large, making distributed minibatch training necessary. The primary contribution of this paper is new methods for reducing communication in the sampling step for distributed GNN training. Here, we propose a matrix-based bulk sampling approach that expresses sampling as a sparse matrix multiplication (SpGEMM) and samples multiple minibatches at once. When the input graph topology does not fit on a single device, our method distributes the graph and use communication-avoiding SpGEMM algorithms to scale GNN minibatch sampling, enabling GNN training on much larger graphs than those that can fit into a single device memory. When the input graph topology (but not the embeddings) fits in the memory of one GPU, our approach (1) performs sampling without communication, (2) amortizes the overheads of sampling a minibatch, and (3) can represent multiple sampling algorithms by simply using different matrix constructions. In addition to new methods for sampling, we introduce a pipeline that uses our matrix-based bulk sampling approach to provide end-to-end training results. We provide experimental results on the largest Open Graph Benchmark (OGB) datasets on $128$ GPUs, and show that our pipeline is $2.5\times$ faster than Quiver (a distributed extension to PyTorch-Geometric) on a $3$-layer GraphSAGE network. On datasets outside of OGB, we show a $8.46\times$ speedup on $128$ GPUs in per-epoch time. Finally, we show scaling when the graph is distributed across GPUs and scaling for both node-wise and layer-wise sampling algorithms.

Distributed Matrix-Based Sampling for Graph Neural Network Training

TL;DR

This work tackles the bottleneck of sampling in distributed minibatch training for Graph Neural Networks by introducing a matrix based bulk sampling framework that expresses sampling as sparse matrix multiplications (SpGEMM). It enables sampling of multiple minibatches at once and scales to graphs that do not fit on a single device, through GPU based 1.5D distributed strategies and communication avoiding SpGEMM. The authors provide end to end pipeline, implement GraphSAGE and LADIES within this framework, and demonstrate substantial speedups over Quiver on large Open Graph Benchmark datasets and across multiple GPUs. The approach preserves model accuracy while delivering significant reductions in sampling time and improved scalability for both node wise and layer wise sampling algorithms. This work advances practical, scalable training of GNNs on very large graphs by tightly integrating matrix based sampling with distributed sparse linear algebra techniques.

Abstract

Graph Neural Networks (GNNs) offer a compact and computationally efficient way to learn embeddings and classifications on graph data. GNN models are frequently large, making distributed minibatch training necessary. The primary contribution of this paper is new methods for reducing communication in the sampling step for distributed GNN training. Here, we propose a matrix-based bulk sampling approach that expresses sampling as a sparse matrix multiplication (SpGEMM) and samples multiple minibatches at once. When the input graph topology does not fit on a single device, our method distributes the graph and use communication-avoiding SpGEMM algorithms to scale GNN minibatch sampling, enabling GNN training on much larger graphs than those that can fit into a single device memory. When the input graph topology (but not the embeddings) fits in the memory of one GPU, our approach (1) performs sampling without communication, (2) amortizes the overheads of sampling a minibatch, and (3) can represent multiple sampling algorithms by simply using different matrix constructions. In addition to new methods for sampling, we introduce a pipeline that uses our matrix-based bulk sampling approach to provide end-to-end training results. We provide experimental results on the largest Open Graph Benchmark (OGB) datasets on GPUs, and show that our pipeline is faster than Quiver (a distributed extension to PyTorch-Geometric) on a -layer GraphSAGE network. On datasets outside of OGB, we show a speedup on GPUs in per-epoch time. Finally, we show scaling when the graph is distributed across GPUs and scaling for both node-wise and layer-wise sampling algorithms.
Paper Structure (45 sections, 7 equations, 7 figures, 4 tables, 2 algorithms)

This paper contains 45 sections, 7 equations, 7 figures, 4 tables, 2 algorithms.

Figures (7)

  • Figure 1: Example outputs for both GraphSAGE and LADIES sampling on a batch with vertices $\{1, 5\}$ and a sample number of $s=2$. Bolded vertices denote vertices included in the sample, and dashed edges denote edges included in the sample.
  • Figure 2: Diagram of matrix operations used to sample the first layer of a GNN formed from the example graph in Figure \ref{['fig:examples']} along with the minibatch $\{1, 5\}$.
  • Figure 3: The overall architecture of our distributed pipeline from the perspective of one process. In this diagram, $\mathbf{A}$ is either the entire adjacency matrix or a partition, depending on the algorithm, and $\mathbf{H}$ is a partition of the input feature matrix. The first step in the pipeline is to run sampling on $k$ minibatches at once. Then, for each minibatch, we iteratively call all-to-allv fetch the appropriate feature vectors, and run propagation on minibatch.
  • Figure 4: Performance results for our pipeline using the Graph Replication algorithm with GraphSAGE and compared with Quiver. For each GPU count, we break down the running time into 1) time spent sampling minibatches, 2) time spent in the feature fetching all-to-allv call, and 3) time spent on forward and backward propagation.
  • Figure 5: Performance comparison of Quiver with GPU sampling and Quiver training with Unified Virtual Address (UVA) sampling on Papers and Protein. UVA sampling uses both the CPU and multiple GPUs to run sampling.
  • ...and 2 more figures