Reducing Memory Contention and I/O Congestion for Disk-based GNN Training

Qisheng Jiang; Lei Jia; Chundong Wang

Reducing Memory Contention and I/O Congestion for Disk-based GNN Training

Qisheng Jiang, Lei Jia, Chundong Wang

TL;DR

This work targets disk-based GNN training on commodity hardware, where memory capacity and I/O bandwidth are bottlenecks. It introduces GNNDrive, a system that minimizes feature-extraction memory footprint through inter-stage buffering and eliminates I/O bottlenecks with asynchronous feature extraction and direct I/O, enabling overlapped computation and data transfer. The design supports mini-batch reordering and multi-GPU data parallelism, avoiding costly pre-processing and leveraging concurrent pipelines. Empirical results show substantial speedups over PyG+, Ginex, and MariusGNN on large graphs like Papers100M, and the approach scales to MAG240M while maintaining convergence properties. Overall, GNNDrive offers a practical, high-performance solution for disk-based GNN training on ordinary machines, broadening accessibility for researchers and small-to-medium enterprises.

Abstract

Graph neural networks (GNNs) gain wide popularity. Large graphs with high-dimensional features become common and training GNNs on them is non-trivial on an ordinary machine. Given a gigantic graph, even sample-based GNN training cannot work efficiently, since it is difficult to keep the graph's entire data in memory during the training process. Leveraging a solid-state drive (SSD) or other storage devices to extend the memory space has been studied in training GNNs. Memory and I/Os are hence critical for effectual disk-based training. We find that state-of-the-art (SoTA) disk-based GNN training systems severely suffer from issues like the memory contention between a graph's topological and feature data, and severe I/O congestion upon loading data from SSD for training. We accordingly develop GNNDrive. GNNDrive 1) minimizes the memory footprint with holistic buffer management across sampling and extracting, and 2) avoids I/O congestion through a strategy of asynchronous feature extraction. It also avoids costly data preparation on the critical path and makes the most of software and hardware resources. Experiments show that GNNDrive achieves superior performance. For example, when training with the Papers100M dataset and GraphSAGE model, GNNDrive is faster than SoTA PyG+, Ginex, and MariusGNN by 16.9x, 2.6x, and 2.7x, respectively.

Reducing Memory Contention and I/O Congestion for Disk-based GNN Training

TL;DR

Abstract

Reducing Memory Contention and I/O Congestion for Disk-based GNN Training

Authors

TL;DR

Abstract

Table of Contents

Figures (15)