Table of Contents
Fetching ...

Large Generative Graph Models

Yu Wang, Ryan A. Rossi, Namyong Park, Huiyuan Chen, Nesreen K. Ahmed, Puja Trivedi, Franck Dernoncourt, Danai Koutra, Tyler Derr

TL;DR

This work introduces Large Graph Generative Models (LGGMs), the first framework to pre-train graph generators on a large, multi-domain corpus (over $5000$ graphs from $13$ domains) to learn transferable graph priors. Leveraging discrete denoising diffusion with forward transitions and a text-conditioned objective, LGGMs achieve strong zero-shot generalization and robust fine-tuning, outperforming prior single-domain graph models such as DiGress in most settings. A key novelty is Text-to-Graph generation, where prompts describing graph domains/names or statistics guide graph synthesis through a neural conditioner, enabling fine-grained control over properties like average degree and clustering coefficient. The results demonstrate practical benefits for cross-domain graph generation and customization, and the work releases code, model checkpoints, and datasets to foster community development and downstream applications.

Abstract

Large Generative Models (LGMs) such as GPT, Stable Diffusion, Sora, and Suno are trained on a huge amount of language corpus, images, videos, and audio that are extremely diverse from numerous domains. This training paradigm over diverse well-curated data lies at the heart of generating creative and sensible content. However, all previous graph generative models (e.g., GraphRNN, MDVAE, MoFlow, GDSS, and DiGress) have been trained only on one dataset each time, which cannot replicate the revolutionary success achieved by LGMs in other fields. To remedy this crucial gap, we propose a new class of graph generative model called Large Graph Generative Model (LGGM) that is trained on a large corpus of graphs (over 5000 graphs) from 13 different domains. We empirically demonstrate that the pre-trained LGGM has superior zero-shot generative capability to existing graph generative models. Furthermore, our pre-trained LGGM can be easily fine-tuned with graphs from target domains and demonstrate even better performance than those directly trained from scratch, behaving as a solid starting point for real-world customization. Inspired by Stable Diffusion, we further equip LGGM with the capability to generate graphs given text prompts (Text-to-Graph), such as the description of the network name and domain (i.e., "The power-1138-bus graph represents a network of buses in a power distribution system."), and network statistics (i.e., "The graph has a low average degree, suitable for modeling social media interactions."). This Text-to-Graph capability integrates the extensive world knowledge in the underlying language model, offering users fine-grained control of the generated graphs. We release the code, the model checkpoint, and the datasets at https://lggm-lg.github.io/.

Large Generative Graph Models

TL;DR

This work introduces Large Graph Generative Models (LGGMs), the first framework to pre-train graph generators on a large, multi-domain corpus (over graphs from domains) to learn transferable graph priors. Leveraging discrete denoising diffusion with forward transitions and a text-conditioned objective, LGGMs achieve strong zero-shot generalization and robust fine-tuning, outperforming prior single-domain graph models such as DiGress in most settings. A key novelty is Text-to-Graph generation, where prompts describing graph domains/names or statistics guide graph synthesis through a neural conditioner, enabling fine-grained control over properties like average degree and clustering coefficient. The results demonstrate practical benefits for cross-domain graph generation and customization, and the work releases code, model checkpoints, and datasets to foster community development and downstream applications.

Abstract

Large Generative Models (LGMs) such as GPT, Stable Diffusion, Sora, and Suno are trained on a huge amount of language corpus, images, videos, and audio that are extremely diverse from numerous domains. This training paradigm over diverse well-curated data lies at the heart of generating creative and sensible content. However, all previous graph generative models (e.g., GraphRNN, MDVAE, MoFlow, GDSS, and DiGress) have been trained only on one dataset each time, which cannot replicate the revolutionary success achieved by LGMs in other fields. To remedy this crucial gap, we propose a new class of graph generative model called Large Graph Generative Model (LGGM) that is trained on a large corpus of graphs (over 5000 graphs) from 13 different domains. We empirically demonstrate that the pre-trained LGGM has superior zero-shot generative capability to existing graph generative models. Furthermore, our pre-trained LGGM can be easily fine-tuned with graphs from target domains and demonstrate even better performance than those directly trained from scratch, behaving as a solid starting point for real-world customization. Inspired by Stable Diffusion, we further equip LGGM with the capability to generate graphs given text prompts (Text-to-Graph), such as the description of the network name and domain (i.e., "The power-1138-bus graph represents a network of buses in a power distribution system."), and network statistics (i.e., "The graph has a low average degree, suitable for modeling social media interactions."). This Text-to-Graph capability integrates the extensive world knowledge in the underlying language model, offering users fine-grained control of the generated graphs. We release the code, the model checkpoint, and the datasets at https://lggm-lg.github.io/.
Paper Structure (47 sections, 2 theorems, 13 equations, 17 figures, 15 tables)

This paper contains 47 sections, 2 theorems, 13 equations, 17 figures, 15 tables.

Key Result

Theorem 1

If the transition matrices $\mathbf{Q}_{X}^{t}, \mathbf{Q}_{E}^{t}$ in Eq. eq-forward are independent of the textual description $\mathbb{S}$, then we have $P(\mathbb{G}^{t - 1}|\mathbb{G}^{t}, \mathbb{G}, \mathbb{S}) \propto P(\mathbb{G}^t|\mathbb{G}^{t - 1})P(\mathbb{G}^{t - 1}|\mathbb{G})$ and co

Figures (17)

  • Figure 1: (a): Average degree and clustering coefficient of graphs from 13 domains. The graph universe consists of graphs from distinct domains (e.g., the tiny region of Chemical Graphs), yet there are some common transferrable patterns. (b): Our pre-trained LGGM after fine-tuning on each domain achieves better generative performance than DiGress trained on that same domain.
  • Figure 2: The overview of LGGM framework and experimental settings. (a): Graph universe including our collected 13 distinct yet representative domains. (b)-(c): Compared with all previous graph generative models that have been trained only on one domain each time, our LGGM is trained on thousands of graphs from 13 domains. (d): We pre-train/fine-tune LGGM in Section \ref{['sec-pretrain']}/\ref{['sec-ft']}. (e): Given the text prompt $S$ and the current generated graph at $t$, we concatenate its textual embedding obtained from a pre-trained language model with the node/edge/graph embeddings after spectral feature extraction and forward them through the Graph Transformer to predict the clean graph.
  • Figure 3: Performance comparison between Fine-tuned LGGM and Fine-tuned DiGress.
  • Figure 4: Text-to-Graph Generation with Prescribed Graph Properties. (a) Controlling Average Clustering Coefficient; (b) Controlling Average Degree. GT-Ground Truth Graphs and Gen-Generated Graphs. Below each graph, the number of nodes and key statistical measures are displayed.
  • Figure 5: With fewer training graphs, Fine-tuned LGGM becomes more advantageous than DiGress.
  • ...and 12 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 2
  • proof