Table of Contents
Fetching ...

Armada: Memory-Efficient Distributed Training of Large-Scale Graph Neural Networks

Roger Waleffe, Devesh Sarda, Jason Mohoney, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Theodoros Rekatsinas, Shivaram Venkataraman

TL;DR

Armada tackles the bottleneck of partitioning in distributed GNN training on billion-scale graphs by introducing GREM, a memory-efficient min-edge-cut partitioner that refines prior node assignments during streaming to achieve METIS-like edge cuts with substantially less memory and time. It couples GREM with a disaggregated CPU-GPU architecture that separates neighborhood sampling and feature loading from model computation, enabling independent scaling of each layer and saturating GPUs. Theoretical analysis and extensive experiments show that refinement is crucial for small streaming chunks, and Armada achieves up to 4.5x runtime improvements and 3.1x cost reductions over SoTA systems, with linear GPU scaling and significant reductions in partitioning overhead. Collectively, Armada demonstrates that careful partitioning and architecture design can dramatically reduce the resources required for large-scale GNN training on cloud hardware, making such training more accessible and cost-effective.

Abstract

We study distributed training of Graph Neural Networks (GNNs) on billion-scale graphs that are partitioned across machines. Efficient training in this setting relies on min-edge-cut partitioning algorithms, which minimize cross-machine communication due to GNN neighborhood sampling. Yet, min-edge-cut partitioning over large graphs remains a challenge: State-of-the-art (SoTA) offline methods (e.g., METIS) are effective, but they require orders of magnitude more memory and runtime than GNN training itself, while computationally efficient algorithms (e.g., streaming greedy approaches) suffer from increased edge cuts. Thus, in this work we introduce Armada, a new end-to-end system for distributed GNN training whose key contribution is GREM, a novel min-edge-cut partitioning algorithm that can efficiently scale to large graphs. GREM builds on streaming greedy approaches with one key addition: prior vertex assignments are continuously refined during streaming, rather than frozen after an initial greedy selection. Our theoretical analysis and experimental results show that this refinement is critical to minimizing edge cuts and enables GREM to reach partition quality comparable to METIS but with 8-65x less memory and 8-46x faster. Given a partitioned graph, Armada leverages a new disaggregated architecture for distributed GNN training to further improve efficiency; we find that on common cloud machines, even with zero communication, GNN neighborhood sampling and feature loading bottleneck training. Disaggregation allows Armada to independently allocate resources for these operations and ensure that expensive GPUs remain saturated with computation. We evaluate Armada against SoTA systems for distributed GNN training and find that the disaggregated architecture leads to runtime improvements up to 4.5x and cost reductions up to 3.1x.

Armada: Memory-Efficient Distributed Training of Large-Scale Graph Neural Networks

TL;DR

Armada tackles the bottleneck of partitioning in distributed GNN training on billion-scale graphs by introducing GREM, a memory-efficient min-edge-cut partitioner that refines prior node assignments during streaming to achieve METIS-like edge cuts with substantially less memory and time. It couples GREM with a disaggregated CPU-GPU architecture that separates neighborhood sampling and feature loading from model computation, enabling independent scaling of each layer and saturating GPUs. Theoretical analysis and extensive experiments show that refinement is crucial for small streaming chunks, and Armada achieves up to 4.5x runtime improvements and 3.1x cost reductions over SoTA systems, with linear GPU scaling and significant reductions in partitioning overhead. Collectively, Armada demonstrates that careful partitioning and architecture design can dramatically reduce the resources required for large-scale GNN training on cloud hardware, making such training more accessible and cost-effective.

Abstract

We study distributed training of Graph Neural Networks (GNNs) on billion-scale graphs that are partitioned across machines. Efficient training in this setting relies on min-edge-cut partitioning algorithms, which minimize cross-machine communication due to GNN neighborhood sampling. Yet, min-edge-cut partitioning over large graphs remains a challenge: State-of-the-art (SoTA) offline methods (e.g., METIS) are effective, but they require orders of magnitude more memory and runtime than GNN training itself, while computationally efficient algorithms (e.g., streaming greedy approaches) suffer from increased edge cuts. Thus, in this work we introduce Armada, a new end-to-end system for distributed GNN training whose key contribution is GREM, a novel min-edge-cut partitioning algorithm that can efficiently scale to large graphs. GREM builds on streaming greedy approaches with one key addition: prior vertex assignments are continuously refined during streaming, rather than frozen after an initial greedy selection. Our theoretical analysis and experimental results show that this refinement is critical to minimizing edge cuts and enables GREM to reach partition quality comparable to METIS but with 8-65x less memory and 8-46x faster. Given a partitioned graph, Armada leverages a new disaggregated architecture for distributed GNN training to further improve efficiency; we find that on common cloud machines, even with zero communication, GNN neighborhood sampling and feature loading bottleneck training. Disaggregation allows Armada to independently allocate resources for these operations and ensure that expensive GPUs remain saturated with computation. We evaluate Armada against SoTA systems for distributed GNN training and find that the disaggregated architecture leads to runtime improvements up to 4.5x and cost reductions up to 3.1x.

Paper Structure

This paper contains 15 sections, 1 equation, 8 figures, 1 table, 2 algorithms.

Figures (8)

  • Figure 1: Breakdown of the average runtime per training iteration in the SoTA system MariusGNN (GraphSage-Large on OGBN-Papers100M; details in Section \ref{['sec:eval']}). Neighborhood sampling plus feature loading on the CPU dominates GNN runtime.
  • Figure 2: CPU utilization in the SoTA system Salient++ (details in Section \ref{['sec:eval']}; GraphSage-Small on OGBN-Papers100M). Nearly all CPU resources are used to parallelize mini batch preparation and minimize training time with one GPU; the CPU resources are insufficient for multi-GPU training, leading to sublinear speedups.
  • Figure 3: Armada system diagram. A. Graph data is partitioned using GREM (Section \ref{['sec:partitioning']}) and then B. stored on disk in the storage layer. C. A disaggregated mini batch preparation layer loads graph partitions into memory and prepares mini batches for workers in the compute layer. D. The compute workers process these mini batches on GPUs and periodically synchronize dense model parameters.
  • Figure 4: Percentage of edges cut when using GREM versus METIS on three common graphs. GREM achieves comparable edge cuts to METIS, even with a chunk size of just 10% or 1%.
  • Figure 5: Memory usage and runtime of GREM and METIS when partitioning subgraphs of OGBN-Papers100M of various size. GREM reduces the computational requirements of partitioning.
  • ...and 3 more figures