Table of Contents
Fetching ...

NeutronTP: Load-Balanced Distributed Full-Graph GNN Training with Tensor Parallelism

Xin Ai, Hao Yuan, Zeyu Ling, Qiange Wang, Yanfeng Zhang, Zhenbo Fu, Chaoyi Chen, Yu Gu, Ge Yu

TL;DR

NeutronTP tackles the inefficiencies of distributed full-graph GNN training by replacing graph-partition data parallelism with tensor parallelism that partitions vertex features across workers, thereby eliminating cross-worker vertex dependencies and achieving full load balance. It introduces a generalized decoupled training framework to separate NN operations from graph aggregation, reducing communication, and a memory-efficient chunk-based task scheduling strategy with inter-chunk pipelining to train graphs larger than GPU memory while overlapping computation and communication. The approach yields substantial speedups over state-of-the-art full-graph and mini-batch systems across multiple datasets and graph types, with maintained accuracy, and demonstrates good scalability and higher GPU utilization. This makes large-scale, heterogeneous-graph GNN training more practical on multi-node GPU clusters, enabling faster experimentation and deployment in real-world graph analytics tasks.

Abstract

Graph neural networks (GNNs) have emerged as a promising direction. Training large-scale graphs that relies on distributed computing power poses new challenges. Existing distributed GNN systems leverage data parallelism by partitioning the input graph and distributing it to multiple workers. However, due to the irregular nature of the graph structure, existing distributed approaches suffer from unbalanced workloads and high overhead in managing cross-worker vertex dependencies. In this paper, we leverage tensor parallelism for distributed GNN training. GNN tensor parallelism eliminates cross-worker vertex dependencies by partitioning features instead of graph structures. Different workers are assigned training tasks on different feature slices with the same dimensional size, leading to a complete load balance. We achieve efficient GNN tensor parallelism through two critical functions. Firstly, we employ a generalized decoupled training framework to decouple NN operations from graph aggregation operations, significantly reducing the communication overhead caused by NN operations which must be computed using complete features. Secondly, we employ a memory-efficient task scheduling strategy to support the training of large graphs exceeding single GPU memory, while further improving performance by overlapping communication and computation. By integrating the above techniques, we propose a distributed GNN training system NeutronTP. Our experimental results on a 16-node Aliyun cluster demonstrate that NeutronTP achieves 1.29X-8.72X speedup over state-of-the-art GNN systems including DistDGL, NeutronStar, and Sancus.

NeutronTP: Load-Balanced Distributed Full-Graph GNN Training with Tensor Parallelism

TL;DR

NeutronTP tackles the inefficiencies of distributed full-graph GNN training by replacing graph-partition data parallelism with tensor parallelism that partitions vertex features across workers, thereby eliminating cross-worker vertex dependencies and achieving full load balance. It introduces a generalized decoupled training framework to separate NN operations from graph aggregation, reducing communication, and a memory-efficient chunk-based task scheduling strategy with inter-chunk pipelining to train graphs larger than GPU memory while overlapping computation and communication. The approach yields substantial speedups over state-of-the-art full-graph and mini-batch systems across multiple datasets and graph types, with maintained accuracy, and demonstrates good scalability and higher GPU utilization. This makes large-scale, heterogeneous-graph GNN training more practical on multi-node GPU clusters, enabling faster experimentation and deployment in real-world graph analytics tasks.

Abstract

Graph neural networks (GNNs) have emerged as a promising direction. Training large-scale graphs that relies on distributed computing power poses new challenges. Existing distributed GNN systems leverage data parallelism by partitioning the input graph and distributing it to multiple workers. However, due to the irregular nature of the graph structure, existing distributed approaches suffer from unbalanced workloads and high overhead in managing cross-worker vertex dependencies. In this paper, we leverage tensor parallelism for distributed GNN training. GNN tensor parallelism eliminates cross-worker vertex dependencies by partitioning features instead of graph structures. Different workers are assigned training tasks on different feature slices with the same dimensional size, leading to a complete load balance. We achieve efficient GNN tensor parallelism through two critical functions. Firstly, we employ a generalized decoupled training framework to decouple NN operations from graph aggregation operations, significantly reducing the communication overhead caused by NN operations which must be computed using complete features. Secondly, we employ a memory-efficient task scheduling strategy to support the training of large graphs exceeding single GPU memory, while further improving performance by overlapping communication and computation. By integrating the above techniques, we propose a distributed GNN training system NeutronTP. Our experimental results on a 16-node Aliyun cluster demonstrate that NeutronTP achieves 1.29X-8.72X speedup over state-of-the-art GNN systems including DistDGL, NeutronStar, and Sancus.
Paper Structure (30 sections, 9 equations, 16 figures, 4 tables, 1 algorithm)

This paper contains 30 sections, 9 equations, 16 figures, 4 tables, 1 algorithm.

Figures (16)

  • Figure 1: Illustration of a single-layer computation process in a GNN model, including graph aggregation operations and neural network (NN) operations.
  • Figure 2: GNN data parallelism vs. GNN tensor parallelism. The thickness of the arrows and the size of the circles are positively proportional to the feature/embedding dimension and indicate the computation volume of GNN training.
  • Figure 3: GNN training workload of 4 partitions under different partitioning methods. (2-layer GCN on Reddit)
  • Figure 4: The VD management overhead of DistDGL and NeutronStar.
  • Figure 5: The number of vertex dependencie of DistDGL and NeutronStar.
  • ...and 11 more figures