Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses

Jeongmin Brian Park; Vikram Sharma Mailthody; Zaid Qureshi; Wen-mei Hwu

Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses

Jeongmin Brian Park, Vikram Sharma Mailthody, Zaid Qureshi, Wen-mei Hwu

TL;DR

The paper tackles the bottleneck of training GNNs on graphs that exceed CPU memory by shifting data preparation fully onto the GPU using the GPU Initiated Direct Storage Access (GIDS) dataloader. It combines the BaM storage-access framework with a dynamic storage access accumulator, a constant CPU buffer, and window buffering to hide storage latency and maximize bandwidth. The approach achieves dramatic end-to-end speedups (up to 582× in some setups) over state-of-the-art baselines on terabyte-scale datasets, including heterogeneous graphs, using a single machine. This work demonstrates that GPU-centric data preparation, coupled with smart cache and data-placement strategies, can unlock scalable, high-throughput GNN training on large graphs without resorting to multi-node deployments. The practical impact is substantial for researchers and practitioners needing efficient single-node training on graphs far larger than CPU memory.

Abstract

Graph Neural Networks (GNNs) are emerging as a powerful tool for learning from graph-structured data and performing sophisticated inference tasks in various application domains. Although GNNs have been shown to be effective on modest-sized graphs, training them on large-scale graphs remains a significant challenge due to lack of efficient data access and data movement methods. Existing frameworks for training GNNs use CPUs for graph sampling and feature aggregation, while the training and updating of model weights are executed on GPUs. However, our in-depth profiling shows the CPUs cannot achieve the throughput required to saturate GNN model training throughput, causing gross under-utilization of expensive GPU resources. Furthermore, when the graph and its embeddings do not fit in the CPU memory, the overhead introduced by the operating system, say for handling page-faults, comes in the critical path of execution. To address these issues, we propose the GPU Initiated Direct Storage Access (GIDS) dataloader, to enable GPU-oriented GNN training for large-scale graphs while efficiently utilizing all hardware resources, such as CPU memory, storage, and GPU memory with a hybrid data placement strategy. By enabling GPU threads to fetch feature vectors directly from storage, GIDS dataloader solves the memory capacity problem for GPU-oriented GNN training. Moreover, GIDS dataloader leverages GPU parallelism to tolerate storage latency and eliminates expensive page-fault overhead. Doing so enables us to design novel optimizations for exploiting locality and increasing effective bandwidth for GNN training. Our evaluation using a single GPU on terabyte-scale GNN datasets shows that GIDS dataloader accelerates the overall DGL GNN training pipeline by up to 392X when compared to the current, state-of-the-art DGL dataloader.

Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses

TL;DR

Abstract

Paper Structure (25 sections, 3 equations, 15 figures, 4 tables)

This paper contains 25 sections, 3 equations, 15 figures, 4 tables.

Introduction
Background
Graph Neural Networks (GNNs)
GNN Training Pipeline
Mini-batching
Node Sampling
Node Feature Aggregation
Limitation of Existing GNN Frameworks
The BaM System
System Design
GIDS Dataloader System Overview
Dynamic Storage Access Accumulator
Constant CPU Buffer
Window Buffering
Graph Structure Data in CPU Memory
...and 10 more sections

Figures (15)

Figure 1: Illustration of the GNN training process with the GIDS dataloader or the BaM dataloader.
Figure 2: A subgraph generated by a uniformly random selection method for two-layer Neighborhood Sampling.
Figure 3: Request generation rate of data preparation on CPU and GPU, and request consumption rate on GPU on IGB-small dataset. The CPU and GPU used in this measurement are listed in Table \ref{['tab:config']}.
Figure 4: Illustration of the GNN training process with the memory-mapping DGL dataloader
Figure 5: GNN training time breakdown for the baseline DGL dataloader for different graph datasets. The node feature data is accessed from memory-mapped files, while the graph structure data is stored in the CPU memory. The GraphSAGE model is used as the GNN training model. The graph properties are listed in Table \ref{['tab:dataset']}.
...and 10 more figures

Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses

TL;DR

Abstract

Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses

Authors

TL;DR

Abstract

Table of Contents

Figures (15)