Table of Contents
Fetching ...

Towards Affordable, Adaptive and Automatic GNN Training on CPU-GPU Heterogeneous Platforms

Tong Qiao, Ao Zhou, Yingjie Qi, Yiou Wang, Han Wan, Jianlei Yang, Chunming Hu

TL;DR

This paper tackles the high cost and inflexibility of GNN training on CPU-GPU heterogenous platforms by introducing A3GNN, which fuses locality-aware sampling, adaptive multi-level parallelism scheduling, and task-hardware oriented auto-tuning driven by reinforcement learning. A surrogate performance model is learned offline to predict throughput, memory footprint, and accuracy, enabling rapid design-space exploration under hardware constraints. The approach yields significant speedups (up to 3.95X in some settings) and scalable performance across diverse datasets, while controlling memory usage and accuracy loss. The work demonstrates how hardware-aware, automated optimization can make large-scale GNN training more accessible on commodity hardware with broad practical impact for researchers and practitioners.

Abstract

Graph Neural Networks (GNNs) have been widely adopted due to their strong performance. However, GNN training often relies on expensive, high-performance computing platforms, limiting accessibility for many tasks. Profiling of representative GNN workloads indicates that substantial efficiency gains are possible on resource-constrained devices by fully exploiting available resources. This paper introduces A3GNN, a framework for affordable, adaptive, and automatic GNN training on heterogeneous CPU-GPU platforms. It improves resource usage through locality-aware sampling and fine-grained parallelism scheduling. Moreover, it leverages reinforcement learning to explore the design space and achieve pareto-optimal trade-offs among throughput, memory footprint, and accuracy. Experiments show that A3GNN can bridge the performance gap, allowing seven Nvidia 2080Ti GPUs to outperform two A100 GPUs by up to 1.8X in throughput with minimal accuracy loss.

Towards Affordable, Adaptive and Automatic GNN Training on CPU-GPU Heterogeneous Platforms

TL;DR

This paper tackles the high cost and inflexibility of GNN training on CPU-GPU heterogenous platforms by introducing A3GNN, which fuses locality-aware sampling, adaptive multi-level parallelism scheduling, and task-hardware oriented auto-tuning driven by reinforcement learning. A surrogate performance model is learned offline to predict throughput, memory footprint, and accuracy, enabling rapid design-space exploration under hardware constraints. The approach yields significant speedups (up to 3.95X in some settings) and scalable performance across diverse datasets, while controlling memory usage and accuracy loss. The work demonstrates how hardware-aware, automated optimization can make large-scale GNN training more accessible on commodity hardware with broad practical impact for researchers and practitioners.

Abstract

Graph Neural Networks (GNNs) have been widely adopted due to their strong performance. However, GNN training often relies on expensive, high-performance computing platforms, limiting accessibility for many tasks. Profiling of representative GNN workloads indicates that substantial efficiency gains are possible on resource-constrained devices by fully exploiting available resources. This paper introduces A3GNN, a framework for affordable, adaptive, and automatic GNN training on heterogeneous CPU-GPU platforms. It improves resource usage through locality-aware sampling and fine-grained parallelism scheduling. Moreover, it leverages reinforcement learning to explore the design space and achieve pareto-optimal trade-offs among throughput, memory footprint, and accuracy. Experiments show that A3GNN can bridge the performance gap, allowing seven Nvidia 2080Ti GPUs to outperform two A100 GPUs by up to 1.8X in throughput with minimal accuracy loss.

Paper Structure

This paper contains 13 sections, 5 equations, 8 figures, 3 tables, 3 algorithms.

Figures (8)

  • Figure 1: Training performance profiling of PyG and Quiver.
  • Figure 2: Profiling GNNs training tasks on different datasets and platforms.
  • Figure 3: Framework overview of $\textrm{A}^{\textsubscript{3}}$GNN.
  • Figure 4: Parallelism scheduling templates of $\textrm{A}^{\textsubscript{3}}$GNN.
  • Figure 5: Task-hardware oriented auto-tuning.
  • ...and 3 more figures