FlowWalker: A Memory-efficient and High-performance GPU-based Dynamic Graph Random Walk Framework
Junyi Mei, Shixuan Sun, Chao Li, Cheng Xu, Cheng Chen, Yibo Liu, Jing Wang, Cheng Zhao, Xiaofeng Hou, Minyi Guo, Bingsheng He, Xiaoliang Cong
TL;DR
FlowWalker addresses the memory and throughput bottlenecks of dynamic graph random walks on GPUs by adopting a sampler-centric computation model and two parallel reservoir sampling schemes that operate with O(1) extra memory per task. The framework deploys a two-stage warp/block sampling engine and a multi-level task pool with dynamic scheduling to balance workloads across highly skewed graphs. It supports four RW algorithms (DeepWalk, PPR, Node2Vec, MetaPath) and demonstrates up to 752.2x speedups over CPU baselines and competitive results against GPU/FPGA systems, while minimizing global-memory overhead. A ByteDance case study shows substantial end-to-end gains in GNN training by reducing RW time from 35% to 3%. Overall, FlowWalker offers a practical, scalable solution for real-world large-scale dynamic graph sampling on GPUs, with broad implications for graph representation learning and online services.
Abstract
Dynamic graph random walk (DGRW) emerges as a practical tool for capturing structural relations within a graph. Effectively executing DGRW on GPU presents certain challenges. First, existing sampling methods demand a pre-processing buffer, causing substantial space complexity. Moreover, the power-law distribution of graph vertex degrees introduces workload imbalance issues, rendering DGRW embarrassed to parallelize. In this paper, we propose FlowWalker, a GPU-based dynamic graph random walk framework. FlowWalker implements an efficient parallel sampling method to fully exploit the GPU parallelism and reduce space complexity. Moreover, it employs a sampler-centric paradigm alongside a dynamic scheduling strategy to handle the huge amounts of walking queries. FlowWalker stands as a memory-efficient framework that requires no auxiliary data structures in GPU global memory. We examine the performance of FlowWalker extensively on ten datasets, and experiment results show that FlowWalker achieves up to 752.2x, 72.1x, and 16.4x speedup compared with existing CPU, GPU, and FPGA random walk frameworks, respectively. Case study shows that FlowWalker diminishes random walk time from 35% to 3% in a pipeline of ByteDance friend recommendation GNN training.
