Table of Contents
Fetching ...

HopGNN: Boosting Distributed GNN Training Efficiency via Feature-Centric Model Migration

Weijian Chen, Shuibing He, Haoyang Qu, Xuechen Zhang

TL;DR

This work proposes LeapGNN, a feature-centric framework that reverses this paradigm by bringing GNN models to vertex features, and employs a micrograph-based merging method that adjusts the number of micrographs for each worker to minimize kernel switches and synchronization overhead.

Abstract

Distributed training of graph neural networks (GNNs) has become a crucial technique for processing large graphs. Prevalent GNN frameworks are model-centric, necessitating the transfer of massive graph vertex features to GNN models, which leads to a significant communication bottleneck. Recognizing that the model size is often significantly smaller than the feature size, we propose LeapGNN, a feature-centric framework that reverses this paradigm by bringing GNN models to vertex features. To make it truly effective, we first propose a micrograph-based training strategy that trains the model using a refined structure with superior locality to reduce remote feature retrieval. Then, we devise a feature pre-gathering approach that merges multiple fetch operations into a single one to eliminate redundant feature transmissions. Finally, we employ a micrograph-based merging method that adjusts the number of micrographs for each worker to minimize kernel switches and synchronization overhead. Our experimental results demonstrate that LeapGNN achieves a performance speedup of up to 4.2x compared to the state-of-the-art method, namely P3.

HopGNN: Boosting Distributed GNN Training Efficiency via Feature-Centric Model Migration

TL;DR

This work proposes LeapGNN, a feature-centric framework that reverses this paradigm by bringing GNN models to vertex features, and employs a micrograph-based merging method that adjusts the number of micrographs for each worker to minimize kernel switches and synchronization overhead.

Abstract

Distributed training of graph neural networks (GNNs) has become a crucial technique for processing large graphs. Prevalent GNN frameworks are model-centric, necessitating the transfer of massive graph vertex features to GNN models, which leads to a significant communication bottleneck. Recognizing that the model size is often significantly smaller than the feature size, we propose LeapGNN, a feature-centric framework that reverses this paradigm by bringing GNN models to vertex features. To make it truly effective, we first propose a micrograph-based training strategy that trains the model using a refined structure with superior locality to reduce remote feature retrieval. Then, we devise a feature pre-gathering approach that merges multiple fetch operations into a single one to eliminate redundant feature transmissions. Finally, we employ a micrograph-based merging method that adjusts the number of micrographs for each worker to minimize kernel switches and synchronization overhead. Our experimental results demonstrate that LeapGNN achieves a performance speedup of up to 4.2x compared to the state-of-the-art method, namely P3.
Paper Structure (24 sections, 23 figures, 3 tables)

This paper contains 24 sections, 23 figures, 3 tables.

Figures (23)

  • Figure 1: A GNN training example.
  • Figure 2: An example of partitioned graph topology and features on two GPU servers.
  • Figure 3: An example of model-centric distributed GNN training approach.
  • Figure 4: Training time breakdown. 'Model-x' means the fanout is x.'SAGE' denotes GraphSAGE.
  • Figure 5: The $\alpha$ value of different models. 'Model (x)' denotes that the number of model layers is x.
  • ...and 18 more figures