Table of Contents
Fetching ...

Random Walk Diffusion for Efficient Large-Scale Graph Generation

Tobias Bernecker, Ghalia Rehawi, Francesco Paolo Casale, Janine Knauer-Arloth, Annalisa Marsico

TL;DR

ARROW-Diff tackles the challenge of generating large-scale graphs with realistic topology by introducing a discrete diffusion process on random walks (OA-ARDM) coupled with a GNN-based edge validator. The method iteratively generates edge proposals from random walks and refines them through degree-guided sampling, enabling scalable generation up to tens of thousands of nodes. Empirical results on five citation graphs and a synthetic SBM show improved topology metrics (e.g., triangles, degree distribution) and substantially faster generation times compared to baselines. The runtime analysis indicates ARROW-Diff achieves favorable complexity $O\left(L\,(N\,D + |E|)\right)$, highlighting its practicality for large-scale graph synthesis.

Abstract

Graph generation addresses the problem of generating new graphs that have a data distribution similar to real-world graphs. While previous diffusion-based graph generation methods have shown promising results, they often struggle to scale to large graphs. In this work, we propose ARROW-Diff (AutoRegressive RandOm Walk Diffusion), a novel random walk-based diffusion approach for efficient large-scale graph generation. Our method encompasses two components in an iterative process of random walk sampling and graph pruning. We demonstrate that ARROW-Diff can scale to large graphs efficiently, surpassing other baseline methods in terms of both generation time and multiple graph statistics, reflecting the high quality of the generated graphs.

Random Walk Diffusion for Efficient Large-Scale Graph Generation

TL;DR

ARROW-Diff tackles the challenge of generating large-scale graphs with realistic topology by introducing a discrete diffusion process on random walks (OA-ARDM) coupled with a GNN-based edge validator. The method iteratively generates edge proposals from random walks and refines them through degree-guided sampling, enabling scalable generation up to tens of thousands of nodes. Empirical results on five citation graphs and a synthetic SBM show improved topology metrics (e.g., triangles, degree distribution) and substantially faster generation times compared to baselines. The runtime analysis indicates ARROW-Diff achieves favorable complexity , highlighting its practicality for large-scale graph synthesis.

Abstract

Graph generation addresses the problem of generating new graphs that have a data distribution similar to real-world graphs. While previous diffusion-based graph generation methods have shown promising results, they often struggle to scale to large graphs. In this work, we propose ARROW-Diff (AutoRegressive RandOm Walk Diffusion), a novel random walk-based diffusion approach for efficient large-scale graph generation. Our method encompasses two components in an iterative process of random walk sampling and graph pruning. We demonstrate that ARROW-Diff can scale to large graphs efficiently, surpassing other baseline methods in terms of both generation time and multiple graph statistics, reflecting the high quality of the generated graphs.
Paper Structure (28 sections, 3 equations, 7 figures, 5 tables, 2 algorithms)

This paper contains 28 sections, 3 equations, 7 figures, 5 tables, 2 algorithms.

Figures (7)

  • Figure 1: Overview of ARROW-Diff graph generation (inference) using a trained OA-ARDM ARDM and a trained GNN. Iteratively, and starting from an empty graph, a diffusion model samples random walks from a set of start nodes. Then, a GNN classifies the proposed edges and filters out invalid ones. This procedure is repeated $L$ times using a different set of sampled start nodes guided by the change of node degrees with respect to the original graph.
  • Figure 2: The change in different graph evaluation metrics with respect to $L$. The dotted line reports the value corresponding to the ground truth graph, in this case CiteSeer. For each metric, the mean and standard deviation were computed across 10 generated graphs using $L \in [1, 30]$ iterations for ARROW-Diff.
  • Figure 3: The mean and standard deviation of the number of edges of 10 graphs generated by ARROW-Diff, using the original node features, are reported for $L \in [1, 30]$. The dotted line represents the number of edges of the original CiteSeer graph.
  • Figure 4: Visualization of the training graphs and generated graphs for Cora-ML and CiteSeer from NetGAN, Graphite, EDGE, BiGG, and ARROW-Diff using the trained models from Table \ref{['tab:single-graph-results']}. We observe that ARROW-Diff is able to capture the basic structure of the original graph.
  • Figure 5: Principal component analysis (PCA) of the Graph2Vec embeddings of the original CiteSeer, DBLP, and PubMed graphs and 100 generated graphs for each of the datasets, showing a clear separation between the generated graphs of the different datasets, and a proximity of the generated graphs to their real graph counterparts.
  • ...and 2 more figures