Table of Contents
Fetching ...

Scaling Large-scale GNN Training to Thousands of Processors on CPU-based Supercomputers

Chen Zhuang, Lingqi Zhang, Du Wu, Peng Chen, Jiajun Huang, Xin Liu, Rio Yokota, Nikoli Dryden, Toshio Endo, Satoshi Matsuoka, Mohamed Wahib

TL;DR

This work tackles memory access inefficiencies and high communication overhead in distributed full-batch GCN training on CPU-based supercomputers. It introduces SuperGCN, a framework built from three pillars: general CPU-friendly aggregation operators, a hierarchical pre-/post-aggregation scheme guided by minimum vertex cover to minimize data exchange, and a communication-aware quantization scheme augmented with masked label propagation to preserve accuracy. The authors demonstrate up to $6\times$ speedup over state-of-the-art CPU baselines and scalable training on thousands of processors across large-scale graphs from the OGBN suite, including datasets with hundreds of millions of nodes and billions of edges, and even outperform peak GPU-based full-batch systems. The approach provides practical viability for CPU-based distributed full-batch GCNs on massive graphs, enabling efficient AI-for-science workloads on existing HPC infrastructure.

Abstract

Graph Convolutional Networks (GCNs), particularly for large-scale graphs, are crucial across numerous domains. However, training distributed full-batch GCNs on large-scale graphs suffers from inefficient memory access patterns and high communication overhead. To address these challenges, we introduce \method{}, an efficient and scalable distributed GCN training framework tailored for CPU-powered supercomputers. Our contributions are threefold: (1) we develop general and efficient aggregation operators designed for irregular memory access, (2) we propose a hierarchical aggregation scheme that reduces communication costs without altering the graph structure, and (3) we present a communication-aware quantization scheme to enhance performance. Experimental results demonstrate that \method{} achieves a speedup of up to 6$\times$ compared with the SoTA implementations, and scales to 1000s of HPC-grade CPUs on the largest publicly available datasets, without sacrificing model convergence and accuracy. Moreover, due to the effective strong scaling of \method{}, we outperform SoTA GPU-based and CPU-based distributed full-batch GCN training frameworks, in absolute performance, for large-scale graphs.

Scaling Large-scale GNN Training to Thousands of Processors on CPU-based Supercomputers

TL;DR

This work tackles memory access inefficiencies and high communication overhead in distributed full-batch GCN training on CPU-based supercomputers. It introduces SuperGCN, a framework built from three pillars: general CPU-friendly aggregation operators, a hierarchical pre-/post-aggregation scheme guided by minimum vertex cover to minimize data exchange, and a communication-aware quantization scheme augmented with masked label propagation to preserve accuracy. The authors demonstrate up to speedup over state-of-the-art CPU baselines and scalable training on thousands of processors across large-scale graphs from the OGBN suite, including datasets with hundreds of millions of nodes and billions of edges, and even outperform peak GPU-based full-batch systems. The approach provides practical viability for CPU-based distributed full-batch GCNs on massive graphs, enabling efficient AI-for-science workloads on existing HPC infrastructure.

Abstract

Graph Convolutional Networks (GCNs), particularly for large-scale graphs, are crucial across numerous domains. However, training distributed full-batch GCNs on large-scale graphs suffers from inefficient memory access patterns and high communication overhead. To address these challenges, we introduce \method{}, an efficient and scalable distributed GCN training framework tailored for CPU-powered supercomputers. Our contributions are threefold: (1) we develop general and efficient aggregation operators designed for irregular memory access, (2) we propose a hierarchical aggregation scheme that reduces communication costs without altering the graph structure, and (3) we present a communication-aware quantization scheme to enhance performance. Experimental results demonstrate that \method{} achieves a speedup of up to 6 compared with the SoTA implementations, and scales to 1000s of HPC-grade CPUs on the largest publicly available datasets, without sacrificing model convergence and accuracy. Moreover, due to the effective strong scaling of \method{}, we outperform SoTA GPU-based and CPU-based distributed full-batch GCN training frameworks, in absolute performance, for large-scale graphs.

Paper Structure

This paper contains 40 sections, 11 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Illustration of graph partitioning. Dotted lines indicate cut edges separating distinct subgraphs. Features of boundary nodes (red circles) need to be transferred across these cut edges via remote communication.
  • Figure 2: The flow of our proposed full-batch GCNs training system. Nodes with a red circle are boundary nodes obtained from other workers.
  • Figure 3: Steps of the proposed algorithm for optimizing the neighbor aggregation operator index_add on a single CPU.
  • Figure 4: Strategies to construct a remote graph.
  • Figure 5: Achieving optimal communication volume via finding minimum vertex cover.
  • ...and 7 more figures