Table of Contents
Fetching ...

FlowWalker: A Memory-efficient and High-performance GPU-based Dynamic Graph Random Walk Framework

Junyi Mei, Shixuan Sun, Chao Li, Cheng Xu, Cheng Chen, Yibo Liu, Jing Wang, Cheng Zhao, Xiaofeng Hou, Minyi Guo, Bingsheng He, Xiaoliang Cong

TL;DR

FlowWalker addresses the memory and throughput bottlenecks of dynamic graph random walks on GPUs by adopting a sampler-centric computation model and two parallel reservoir sampling schemes that operate with O(1) extra memory per task. The framework deploys a two-stage warp/block sampling engine and a multi-level task pool with dynamic scheduling to balance workloads across highly skewed graphs. It supports four RW algorithms (DeepWalk, PPR, Node2Vec, MetaPath) and demonstrates up to 752.2x speedups over CPU baselines and competitive results against GPU/FPGA systems, while minimizing global-memory overhead. A ByteDance case study shows substantial end-to-end gains in GNN training by reducing RW time from 35% to 3%. Overall, FlowWalker offers a practical, scalable solution for real-world large-scale dynamic graph sampling on GPUs, with broad implications for graph representation learning and online services.

Abstract

Dynamic graph random walk (DGRW) emerges as a practical tool for capturing structural relations within a graph. Effectively executing DGRW on GPU presents certain challenges. First, existing sampling methods demand a pre-processing buffer, causing substantial space complexity. Moreover, the power-law distribution of graph vertex degrees introduces workload imbalance issues, rendering DGRW embarrassed to parallelize. In this paper, we propose FlowWalker, a GPU-based dynamic graph random walk framework. FlowWalker implements an efficient parallel sampling method to fully exploit the GPU parallelism and reduce space complexity. Moreover, it employs a sampler-centric paradigm alongside a dynamic scheduling strategy to handle the huge amounts of walking queries. FlowWalker stands as a memory-efficient framework that requires no auxiliary data structures in GPU global memory. We examine the performance of FlowWalker extensively on ten datasets, and experiment results show that FlowWalker achieves up to 752.2x, 72.1x, and 16.4x speedup compared with existing CPU, GPU, and FPGA random walk frameworks, respectively. Case study shows that FlowWalker diminishes random walk time from 35% to 3% in a pipeline of ByteDance friend recommendation GNN training.

FlowWalker: A Memory-efficient and High-performance GPU-based Dynamic Graph Random Walk Framework

TL;DR

FlowWalker addresses the memory and throughput bottlenecks of dynamic graph random walks on GPUs by adopting a sampler-centric computation model and two parallel reservoir sampling schemes that operate with O(1) extra memory per task. The framework deploys a two-stage warp/block sampling engine and a multi-level task pool with dynamic scheduling to balance workloads across highly skewed graphs. It supports four RW algorithms (DeepWalk, PPR, Node2Vec, MetaPath) and demonstrates up to 752.2x speedups over CPU baselines and competitive results against GPU/FPGA systems, while minimizing global-memory overhead. A ByteDance case study shows substantial end-to-end gains in GNN training by reducing RW time from 35% to 3%. Overall, FlowWalker offers a practical, scalable solution for real-world large-scale dynamic graph sampling on GPUs, with broad implications for graph representation learning and online services.

Abstract

Dynamic graph random walk (DGRW) emerges as a practical tool for capturing structural relations within a graph. Effectively executing DGRW on GPU presents certain challenges. First, existing sampling methods demand a pre-processing buffer, causing substantial space complexity. Moreover, the power-law distribution of graph vertex degrees introduces workload imbalance issues, rendering DGRW embarrassed to parallelize. In this paper, we propose FlowWalker, a GPU-based dynamic graph random walk framework. FlowWalker implements an efficient parallel sampling method to fully exploit the GPU parallelism and reduce space complexity. Moreover, it employs a sampler-centric paradigm alongside a dynamic scheduling strategy to handle the huge amounts of walking queries. FlowWalker stands as a memory-efficient framework that requires no auxiliary data structures in GPU global memory. We examine the performance of FlowWalker extensively on ten datasets, and experiment results show that FlowWalker achieves up to 752.2x, 72.1x, and 16.4x speedup compared with existing CPU, GPU, and FPGA random walk frameworks, respectively. Case study shows that FlowWalker diminishes random walk time from 35% to 3% in a pipeline of ByteDance friend recommendation GNN training.
Paper Structure (30 sections, 2 theorems, 3 equations, 16 figures, 5 tables, 4 algorithms)

This paper contains 30 sections, 2 theorems, 3 equations, 16 figures, 5 tables, 4 algorithms.

Key Result

Proposition 1

Given a sequence $S$ of vertices and the corresponding weight sequence $W$, Algorithm algo:direct_parallel_rs picks $v$ with the probability $\frac{w_v}{\sum W}$ where $w_v$ is the weight of $v$.

Figures (16)

  • Figure 1: The procedure for sampling a neighbor of $v_0$.
  • Figure 2: System Design Overview of FlowWalker. The execution flow is organized as follows: ① Thread blocks fetch tasks from the global task pool into its local task pool. ② Tasks are dispatched to the appropriate sampler based on the vertex degree. ③ Warp and block samplers execute the sampling tasks. The process necessitates graph data stored in the global memory and random number generators (RNG) stored in the shared memory. ④ The query states in the local task pool are updated and the sampling results are recorded.
  • Figure 3: The comparison of DPRS and ZPRS on sampling a neighbor of $v_0$ in Figure \ref{['fig:graph']} using three threads. DPRS scans $W$ once, but the number of collective operations depends on the number of iterations. ZPRS performs two collective operations only, but scans $W$ twice. Logically, DPRS scans along the sequence of $S$, whereas ZPRS scans in a zig-zag order of $S$.
  • Figure 4: Computation in a thread block. A query will not be evicted from a thread block until stop conditions are met. Tasks are processed in two stages. First, warp samplers process tasks in which the degree of the current residing vertex is no greater than $d_t$. Then, the block sampler processes the remaining tasks. After sampling, if one query meets the stop conditions, a new query will be fetched from the global task pool ($P_G$) and added to the local task pool ($P_L$) (Step ⑤.1). Otherwise, we update the query state in $P_L$ (Step ⑤.2).
  • Figure 5: Queries are grouped into batches which execute alternatively in two CUDA streams. $h$ refers to the head pointer of the global task pool. Thread blocks fetch tasks in a preemptive way if they have empty slots.
  • ...and 11 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 2