Scaling Large-scale GNN Training to Thousands of Processors on CPU-based Supercomputers
Chen Zhuang, Lingqi Zhang, Du Wu, Peng Chen, Jiajun Huang, Xin Liu, Rio Yokota, Nikoli Dryden, Toshio Endo, Satoshi Matsuoka, Mohamed Wahib
TL;DR
This work tackles memory access inefficiencies and high communication overhead in distributed full-batch GCN training on CPU-based supercomputers. It introduces SuperGCN, a framework built from three pillars: general CPU-friendly aggregation operators, a hierarchical pre-/post-aggregation scheme guided by minimum vertex cover to minimize data exchange, and a communication-aware quantization scheme augmented with masked label propagation to preserve accuracy. The authors demonstrate up to $6\times$ speedup over state-of-the-art CPU baselines and scalable training on thousands of processors across large-scale graphs from the OGBN suite, including datasets with hundreds of millions of nodes and billions of edges, and even outperform peak GPU-based full-batch systems. The approach provides practical viability for CPU-based distributed full-batch GCNs on massive graphs, enabling efficient AI-for-science workloads on existing HPC infrastructure.
Abstract
Graph Convolutional Networks (GCNs), particularly for large-scale graphs, are crucial across numerous domains. However, training distributed full-batch GCNs on large-scale graphs suffers from inefficient memory access patterns and high communication overhead. To address these challenges, we introduce \method{}, an efficient and scalable distributed GCN training framework tailored for CPU-powered supercomputers. Our contributions are threefold: (1) we develop general and efficient aggregation operators designed for irregular memory access, (2) we propose a hierarchical aggregation scheme that reduces communication costs without altering the graph structure, and (3) we present a communication-aware quantization scheme to enhance performance. Experimental results demonstrate that \method{} achieves a speedup of up to 6$\times$ compared with the SoTA implementations, and scales to 1000s of HPC-grade CPUs on the largest publicly available datasets, without sacrificing model convergence and accuracy. Moreover, due to the effective strong scaling of \method{}, we outperform SoTA GPU-based and CPU-based distributed full-batch GCN training frameworks, in absolute performance, for large-scale graphs.
