HopGNN: Boosting Distributed GNN Training Efficiency via Feature-Centric Model Migration

Weijian Chen; Shuibing He; Haoyang Qu; Xuechen Zhang

HopGNN: Boosting Distributed GNN Training Efficiency via Feature-Centric Model Migration

Weijian Chen, Shuibing He, Haoyang Qu, Xuechen Zhang

TL;DR

This work proposes LeapGNN, a feature-centric framework that reverses this paradigm by bringing GNN models to vertex features, and employs a micrograph-based merging method that adjusts the number of micrographs for each worker to minimize kernel switches and synchronization overhead.

Abstract

Distributed training of graph neural networks (GNNs) has become a crucial technique for processing large graphs. Prevalent GNN frameworks are model-centric, necessitating the transfer of massive graph vertex features to GNN models, which leads to a significant communication bottleneck. Recognizing that the model size is often significantly smaller than the feature size, we propose LeapGNN, a feature-centric framework that reverses this paradigm by bringing GNN models to vertex features. To make it truly effective, we first propose a micrograph-based training strategy that trains the model using a refined structure with superior locality to reduce remote feature retrieval. Then, we devise a feature pre-gathering approach that merges multiple fetch operations into a single one to eliminate redundant feature transmissions. Finally, we employ a micrograph-based merging method that adjusts the number of micrographs for each worker to minimize kernel switches and synchronization overhead. Our experimental results demonstrate that LeapGNN achieves a performance speedup of up to 4.2x compared to the state-of-the-art method, namely P3.

HopGNN: Boosting Distributed GNN Training Efficiency via Feature-Centric Model Migration

TL;DR

Abstract

Paper Structure (24 sections, 23 figures, 3 tables)

This paper contains 24 sections, 23 figures, 3 tables.

Introduction
Background
Motivation and Challenges
Communication Bottleneck in GNN Training
A Naive Feature-Centric Training Approach
New Abstraction: Micrograph
Design of HopGNN
Micrograph-Based GNN Training
Vertex Feature Pre-Gathering
Micrograph Merging in GNN Training
Implementation
Evaluation
Experimental Setup
Overall Performance
Impact of Individual Techniques
...and 9 more sections

Figures (23)

Figure 1: A GNN training example.
Figure 2: An example of partitioned graph topology and features on two GPU servers.
Figure 3: An example of model-centric distributed GNN training approach.
Figure 4: Training time breakdown. 'Model-x' means the fanout is x.'SAGE' denotes GraphSAGE.
Figure 5: The $\alpha$ value of different models. 'Model (x)' denotes that the number of model layers is x.
...and 18 more figures

HopGNN: Boosting Distributed GNN Training Efficiency via Feature-Centric Model Migration

TL;DR

Abstract

HopGNN: Boosting Distributed GNN Training Efficiency via Feature-Centric Model Migration

Authors

TL;DR

Abstract

Table of Contents

Figures (23)