MixGCN: Scalable GCN Training by Mixture of Parallelism and Mixture of Accelerators

Cheng Wan; Runkai Tao; Zheng Du; Yang Katie Zhao; Yingyan Celine Lin

MixGCN: Scalable GCN Training by Mixture of Parallelism and Mixture of Accelerators

Cheng Wan, Runkai Tao, Zheng Du, Yang Katie Zhao, Yingyan Celine Lin

TL;DR

MixGCN tackles scalable full-graph GCN training by jointly deploying Mixture of Parallelism (MoP) and Mixture of Accelerators (MoA). MoP achieves constant communication and memory behavior by feature-splitting across aggregation units, avoiding remote-neighbor duplication, while MoA assigns sparse and dense operations to dedicated accelerators and introduces S-SpMM for fused sparse computation along with a node-reordering pipeline to reduce idle time. The framework is validated on five large datasets, showing substantial end-to-end throughput gains (up to 10.4x over strong baselines and up to 17.2x in sparse-accelerator simulations), memory efficiency, and energy savings. These results demonstrate MixGCN as a scalable, architecture-aware solution for large-scale GCN training with practical implications for real-world graph analytics and learning tasks.

Abstract

Graph convolutional networks (GCNs) have demonstrated superiority in graph-based learning tasks. However, training GCNs on full graphs is particularly challenging, due to the following two challenges: (1) the associated feature tensors can easily explode the memory and block the communication bandwidth of modern accelerators, and (2) the computation workflow in training GCNs alternates between sparse and dense matrix operations, complicating the efficient utilization of computational resources. Existing solutions for scalable distributed full-graph GCN training mostly adopt partition parallelism, which is unsatisfactory as they only partially address the first challenge while incurring scaled-out communication volume. To this end, we propose MixGCN aiming to simultaneously address both the aforementioned challenges towards GCN training. To tackle the first challenge, MixGCN integrates mixture of parallelism. Both theoretical and empirical analysis verify its constant communication volumes and enhanced balanced workload; For handling the second challenge, we consider mixture of accelerators (i.e., sparse and dense accelerators) with a dedicated accelerator for GCN training and a fine-grain pipeline. Extensive experiments show that MixGCN achieves boosted training efficiency and scalability.

MixGCN: Scalable GCN Training by Mixture of Parallelism and Mixture of Accelerators

TL;DR

Abstract

Paper Structure (27 sections, 5 theorems, 3 equations, 14 figures, 5 tables, 2 algorithms)

This paper contains 27 sections, 5 theorems, 3 equations, 14 figures, 5 tables, 2 algorithms.

Introduction
Background and Related Work
Graph Convolutional Networks
Partition Parallelism for GCN Training
Tensor Parallel Computing
GCN Accelerators
The Proposed Framework
Mixture of Parallelism (MoP)
Partition Parallelism
The Proposed Mixture of Parallelism (MoP)
Scalability of All-to-All Communication
Mixture of Accelerators (MoA)
An Accelerator for Operator Fusion
A Pipeline Scheduler with Node Reordering
Experiments
...and 12 more sections

Key Result

Proposition 3.1

Balancing the computation workload of GCN training with partition parallelism is NP-Hard.

Figures (14)

Figure 1: An illustrative comparison between partition parallelism and the proposed MixGCN, where MixGCN avoids the scaled-out communication volume needed for duplicated remote neighbor features (highlighted in red in (b)) as required by partition parallelism.
Figure 2: Illustrating the workflow of our proposed mixture of parallelism (MoP) where we adopt 3 pairs of aggregation and update accelerators for visual clarity.
Figure 3: An illustration of S-SpMM in the accelerator for neighbor aggregation.
Figure 4: An illustration of the proposed mixture of accelerators (MoA), which integrates a dedicated accelerator for computing S-SpMM (Sampled Sparse Matrix-Matrix Multiplication).
Figure 5: An example that illustrates the comparison of the temporal execution flow among different pipeline designs between the sparse and dense accelerators. We assume that the training graph is identical to the graph in Figure \ref{['fig:gcn_comparison_a']}.
...and 9 more figures

Theorems & Definitions (5)

Proposition 3.1
Proposition 3.2
Proposition 3.3
Proposition 3.4
Proposition 3.5

MixGCN: Scalable GCN Training by Mixture of Parallelism and Mixture of Accelerators

TL;DR

Abstract

MixGCN: Scalable GCN Training by Mixture of Parallelism and Mixture of Accelerators

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (5)