CaPGNN: Optimizing Parallel Graph Neural Network Training with Joint Caching and Resource-Aware Graph Partitioning
Xianfeng Song, Yi Zou, Zheng Shi
TL;DR
CaPGNN tackles the bottleneck of communication in parallel full-batch GNN training on a single server with multiple GPUs by jointly optimizing cache usage and graph partitioning. The framework introduces JACA to cache halo vertex data across CPU and GPU memories and RAPA to adapt partitions to heterogeneous GPU capabilities, while overlapping computation and communication in a pipeline. The authors prove convergence under bounded staleness and demonstrate up to 18.98x speedups and up to 99% reduction in communication across multiple datasets, with minimal or even positive effects on accuracy. The work also shows that CaPGNN scales to distributed multi-machine settings with initial feasibility, providing a practical solution for scalable full-graph GNN training on commodity hardware.
Abstract
Graph-structured data is ubiquitous in the real world, and Graph Neural Networks (GNNs) have become increasingly popular in various fields due to their ability to process such irregular data directly. However, as data scale, GNNs become inefficient. Although parallel training offers performance improvements, increased communication costs often offset these advantages. To address this, this paper introduces CaPGNN, a novel parallel full-batch GNN training framework on single-server with multi-GPU. Firstly, considering the fact that the number of remote vertices in a partition is often greater than or equal to the number of local vertices and there may exist many duplicate vertices, we propose a joint adaptive caching algorithm that leverages both CPU and GPU memory, integrating lightweight cache update and prefetch techniques to effectively reduce redundant communication costs. Furthermore, taking into account the varying computational and communication capabilities among GPUs, we propose a communication- and computation-aware heuristic graph partitioning algorithm inspired by graph sparsification. Additionally, we implement a pipeline to overlap computation and communication. Extensive experiments show that CaPGNN improves training efficiency by up to 18.98x and reduces communication costs by up to 99%, with minimal accuracy loss or even accuracy improvement in some cases. Finally, we extend CaPGNN to multi-machine multi-GPU environments. The code is available at https://github.com/songxf1024/CaPGNN.
