Table of Contents
Fetching ...

CaPGNN: Optimizing Parallel Graph Neural Network Training with Joint Caching and Resource-Aware Graph Partitioning

Xianfeng Song, Yi Zou, Zheng Shi

TL;DR

CaPGNN tackles the bottleneck of communication in parallel full-batch GNN training on a single server with multiple GPUs by jointly optimizing cache usage and graph partitioning. The framework introduces JACA to cache halo vertex data across CPU and GPU memories and RAPA to adapt partitions to heterogeneous GPU capabilities, while overlapping computation and communication in a pipeline. The authors prove convergence under bounded staleness and demonstrate up to 18.98x speedups and up to 99% reduction in communication across multiple datasets, with minimal or even positive effects on accuracy. The work also shows that CaPGNN scales to distributed multi-machine settings with initial feasibility, providing a practical solution for scalable full-graph GNN training on commodity hardware.

Abstract

Graph-structured data is ubiquitous in the real world, and Graph Neural Networks (GNNs) have become increasingly popular in various fields due to their ability to process such irregular data directly. However, as data scale, GNNs become inefficient. Although parallel training offers performance improvements, increased communication costs often offset these advantages. To address this, this paper introduces CaPGNN, a novel parallel full-batch GNN training framework on single-server with multi-GPU. Firstly, considering the fact that the number of remote vertices in a partition is often greater than or equal to the number of local vertices and there may exist many duplicate vertices, we propose a joint adaptive caching algorithm that leverages both CPU and GPU memory, integrating lightweight cache update and prefetch techniques to effectively reduce redundant communication costs. Furthermore, taking into account the varying computational and communication capabilities among GPUs, we propose a communication- and computation-aware heuristic graph partitioning algorithm inspired by graph sparsification. Additionally, we implement a pipeline to overlap computation and communication. Extensive experiments show that CaPGNN improves training efficiency by up to 18.98x and reduces communication costs by up to 99%, with minimal accuracy loss or even accuracy improvement in some cases. Finally, we extend CaPGNN to multi-machine multi-GPU environments. The code is available at https://github.com/songxf1024/CaPGNN.

CaPGNN: Optimizing Parallel Graph Neural Network Training with Joint Caching and Resource-Aware Graph Partitioning

TL;DR

CaPGNN tackles the bottleneck of communication in parallel full-batch GNN training on a single server with multiple GPUs by jointly optimizing cache usage and graph partitioning. The framework introduces JACA to cache halo vertex data across CPU and GPU memories and RAPA to adapt partitions to heterogeneous GPU capabilities, while overlapping computation and communication in a pipeline. The authors prove convergence under bounded staleness and demonstrate up to 18.98x speedups and up to 99% reduction in communication across multiple datasets, with minimal or even positive effects on accuracy. The work also shows that CaPGNN scales to distributed multi-machine settings with initial feasibility, providing a practical solution for scalable full-graph GNN training on commodity hardware.

Abstract

Graph-structured data is ubiquitous in the real world, and Graph Neural Networks (GNNs) have become increasingly popular in various fields due to their ability to process such irregular data directly. However, as data scale, GNNs become inefficient. Although parallel training offers performance improvements, increased communication costs often offset these advantages. To address this, this paper introduces CaPGNN, a novel parallel full-batch GNN training framework on single-server with multi-GPU. Firstly, considering the fact that the number of remote vertices in a partition is often greater than or equal to the number of local vertices and there may exist many duplicate vertices, we propose a joint adaptive caching algorithm that leverages both CPU and GPU memory, integrating lightweight cache update and prefetch techniques to effectively reduce redundant communication costs. Furthermore, taking into account the varying computational and communication capabilities among GPUs, we propose a communication- and computation-aware heuristic graph partitioning algorithm inspired by graph sparsification. Additionally, we implement a pipeline to overlap computation and communication. Extensive experiments show that CaPGNN improves training efficiency by up to 18.98x and reduces communication costs by up to 99%, with minimal accuracy loss or even accuracy improvement in some cases. Finally, we extend CaPGNN to multi-machine multi-GPU environments. The code is available at https://github.com/songxf1024/CaPGNN.

Paper Structure

This paper contains 29 sections, 4 theorems, 16 equations, 26 figures, 9 tables, 3 algorithms.

Key Result

Lemma 1

Let the infinity norms of matrices $\mathbf{A}$ and $\mathbf{B}$ be defined as $\|\mathbf{A}\|_\infty = \max_{i,j} |\mathbf{A}_{i,j}|$ and $\|\mathbf{B}\|_\infty = \max_{i,j} |\mathbf{B}_{i,j}|$. The following inequalities are satisfied xue2023sugar: (a) $\|\mathbf{A}\mathbf{B}\|_\infty \leq \text{c

Figures (26)

  • Figure 1: Message passing in a 2-layer GNN. The blue vertex as the target vertex updated via aggregation and combination.
  • Figure 2: Vertex-centric graph partition. The original graph is partitioned into two subgraphs, with solid circles as inner vertices and dashed circles as halo vertices.
  • Figure 3: The workflow of a typical parallel GNN process. The original graph is partitioned into subgraphs using METIS. Then, subgraphs and vertex/edge features are distributed to workers. Each worker performs localized computation on its assigned subgraph. Boundary informations are exchanged between workers through PCIe, or via P2P links if available.
  • Figure 4: Ratio of halo vertices to inner vertices for different numbers of partitions, hops, datasets, and partition methods.
  • Figure 5: Correlation between edge cut and total 1 hop halo vertex count across datasets and partition numbers. Here, edge cut counts unique inter-partition edges, where each bidirectional pair is counted once.
  • ...and 21 more figures

Theorems & Definitions (7)

  • Lemma 1
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Theorem 1
  • proof