Table of Contents
Fetching ...

ARGO: An Auto-Tuning Runtime System for Scalable GNN Training on Multi-Core Processor

Yi-Chien Lin, Yuyang Chen, Sameh Gobriel, Nilesh Jain, Gopi Krishna Jha, Viktor Prasanna

TL;DR

ARGO tackles the difficulty of scaling GNN training on multi-core CPUs by combining a Multi-Process Engine with a core-binding mechanism and a lightweight online auto-tuner based on Bayesian Optimization. The system runs multiple training processes in parallel to overlap computation and communication, while the auto-tuner learns platform- and model-specific configurations on-the-fly with minimal overhead. It preserves the original training semantics by adjusting mini-batch sizes and using DDP for gradient synchronization, achieving up to 5.06x and 4.54x speedups on representative platforms and datasets. The work demonstrates robust scalability improvements and seamless integration with popular GNN libraries, offering a practical path to faster CPU-based GNN training in real-world deployments, while highlighting future NUMA-bandwidth considerations for further gains.

Abstract

As Graph Neural Networks (GNNs) become popular, libraries like PyTorch-Geometric (PyG) and Deep Graph Library (DGL) are proposed; these libraries have emerged as the de facto standard for implementing GNNs because they provide graph-oriented APIs and are purposefully designed to manage the inherent sparsity and irregularity in graph structures. However, these libraries show poor scalability on multi-core processors, which under-utilizes the available platform resources and limits the performance. This is because GNN training is a resource-intensive workload with high volume of irregular data accessing, and existing libraries fail to utilize the memory bandwidth efficiently. To address this challenge, we propose ARGO, a novel runtime system for GNN training that offers scalable performance. ARGO exploits multi-processing and core-binding techniques to improve platform resource utilization. We further develop an auto-tuner that searches for the optimal configuration for multi-processing and core-binding. The auto-tuner works automatically, making it completely transparent from the user. Furthermore, the auto-tuner allows ARGO to adapt to various platforms, GNN models, datasets, etc. We evaluate ARGO on two representative GNN models and four widely-used datasets on two platforms. With the proposed autotuner, ARGO is able to select a near-optimal configuration by exploring only 5% of the design space. ARGO speeds up state-of-the-art GNN libraries by up to 5.06x and 4.54x on a four-socket Ice Lake machine with 112 cores and a two-socket Sapphire Rapids machine with 64 cores, respectively. Finally, ARGO can seamlessly integrate into widely-used GNN libraries (e.g., DGL, PyG) with few lines of code and speed up GNN training.

ARGO: An Auto-Tuning Runtime System for Scalable GNN Training on Multi-Core Processor

TL;DR

ARGO tackles the difficulty of scaling GNN training on multi-core CPUs by combining a Multi-Process Engine with a core-binding mechanism and a lightweight online auto-tuner based on Bayesian Optimization. The system runs multiple training processes in parallel to overlap computation and communication, while the auto-tuner learns platform- and model-specific configurations on-the-fly with minimal overhead. It preserves the original training semantics by adjusting mini-batch sizes and using DDP for gradient synchronization, achieving up to 5.06x and 4.54x speedups on representative platforms and datasets. The work demonstrates robust scalability improvements and seamless integration with popular GNN libraries, offering a practical path to faster CPU-based GNN training in real-world deployments, while highlighting future NUMA-bandwidth considerations for further gains.

Abstract

As Graph Neural Networks (GNNs) become popular, libraries like PyTorch-Geometric (PyG) and Deep Graph Library (DGL) are proposed; these libraries have emerged as the de facto standard for implementing GNNs because they provide graph-oriented APIs and are purposefully designed to manage the inherent sparsity and irregularity in graph structures. However, these libraries show poor scalability on multi-core processors, which under-utilizes the available platform resources and limits the performance. This is because GNN training is a resource-intensive workload with high volume of irregular data accessing, and existing libraries fail to utilize the memory bandwidth efficiently. To address this challenge, we propose ARGO, a novel runtime system for GNN training that offers scalable performance. ARGO exploits multi-processing and core-binding techniques to improve platform resource utilization. We further develop an auto-tuner that searches for the optimal configuration for multi-processing and core-binding. The auto-tuner works automatically, making it completely transparent from the user. Furthermore, the auto-tuner allows ARGO to adapt to various platforms, GNN models, datasets, etc. We evaluate ARGO on two representative GNN models and four widely-used datasets on two platforms. With the proposed autotuner, ARGO is able to select a near-optimal configuration by exploring only 5% of the design space. ARGO speeds up state-of-the-art GNN libraries by up to 5.06x and 4.54x on a four-socket Ice Lake machine with 112 cores and a two-socket Sapphire Rapids machine with 64 cores, respectively. Finally, ARGO can seamlessly integrate into widely-used GNN libraries (e.g., DGL, PyG) with few lines of code and speed up GNN training.
Paper Structure (33 sections, 3 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 33 sections, 3 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: State-of-the-art GNN libraries suffer from poor scalability
  • Figure 2: Time-trace of (A) running a single GNN training and (B) running two GNN training programs in parallel
  • Figure 3: System overview of ARGO
  • Figure 4: Task Coordination in ARGO
  • Figure 5: Reducing the batch size increases the workload
  • ...and 7 more figures