Table of Contents
Fetching ...

GraphMaster: Automated Graph Synthesis via LLM Agents in Data-Limited Environments

Enjun Du, Xunkai Li, Tian Jin, Zhihan Zhang, Rong-Hua Li, Guoren Wang

TL;DR

GraphMaster tackles the data scarcity bottleneck in Graph Foundation Models by enabling semantically rich, text-attributed graph synthesis in data-limited environments. It introduces a hierarchical Retrieval-Augmented Generation framework where four specialized LLM-powered agents (Manager, Perception, Enhancement, Evaluation) iteratively extract knowledge, generate content, and assess quality to ensure semantic coherence and structural validity. The work contributes a standardized data-limited benchmark suite for TAG synthesis, a dual-perspective interpretability framework using human evaluation and Grassmannian analysis, and extensive experiments showing state-of-the-art performance across multiple datasets and GNN architectures. Practically, GraphMaster provides a regulated, interpretable pathway to produce high-quality synthetic TAG data for training GFMs when data is scarce, with potential applications in knowledge graphs, scientific discovery, and recommendation systems.

Abstract

The era of foundation models has revolutionized AI research, yet Graph Foundation Models (GFMs) remain constrained by the scarcity of large-scale graph corpora. Traditional graph data synthesis techniques primarily focus on simplistic structural operations, lacking the capacity to generate semantically rich nodes with meaningful textual attributes: a critical limitation for real-world applications. While large language models (LLMs) demonstrate exceptional text generation capabilities, their direct application to graph synthesis is impeded by context window limitations, hallucination phenomena, and structural consistency challenges. To address these issues, we introduce GraphMaster, the first multi-agent framework specifically designed for graph data synthesis in data-limited environments. GraphMaster orchestrates four specialized LLM agents (Manager, Perception, Enhancement, and Evaluation) that collaboratively optimize the synthesis process through iterative refinement, ensuring both semantic coherence and structural integrity. To rigorously evaluate our approach, we create new data-limited "Sub" variants of six standard graph benchmarks, specifically designed to test synthesis capabilities under realistic constraints. Additionally, we develop a novel interpretability assessment framework that combines human evaluation with a principled Grassmannian manifold-based analysis, providing both qualitative and quantitative measures of semantic coherence. Experimental results demonstrate that GraphMaster significantly outperforms traditional synthesis methods across multiple datasets, establishing a strong foundation for advancing GFMs in data-scarce environments.

GraphMaster: Automated Graph Synthesis via LLM Agents in Data-Limited Environments

TL;DR

GraphMaster tackles the data scarcity bottleneck in Graph Foundation Models by enabling semantically rich, text-attributed graph synthesis in data-limited environments. It introduces a hierarchical Retrieval-Augmented Generation framework where four specialized LLM-powered agents (Manager, Perception, Enhancement, Evaluation) iteratively extract knowledge, generate content, and assess quality to ensure semantic coherence and structural validity. The work contributes a standardized data-limited benchmark suite for TAG synthesis, a dual-perspective interpretability framework using human evaluation and Grassmannian analysis, and extensive experiments showing state-of-the-art performance across multiple datasets and GNN architectures. Practically, GraphMaster provides a regulated, interpretable pathway to produce high-quality synthetic TAG data for training GFMs when data is scarce, with potential applications in knowledge graphs, scientific discovery, and recommendation systems.

Abstract

The era of foundation models has revolutionized AI research, yet Graph Foundation Models (GFMs) remain constrained by the scarcity of large-scale graph corpora. Traditional graph data synthesis techniques primarily focus on simplistic structural operations, lacking the capacity to generate semantically rich nodes with meaningful textual attributes: a critical limitation for real-world applications. While large language models (LLMs) demonstrate exceptional text generation capabilities, their direct application to graph synthesis is impeded by context window limitations, hallucination phenomena, and structural consistency challenges. To address these issues, we introduce GraphMaster, the first multi-agent framework specifically designed for graph data synthesis in data-limited environments. GraphMaster orchestrates four specialized LLM agents (Manager, Perception, Enhancement, and Evaluation) that collaboratively optimize the synthesis process through iterative refinement, ensuring both semantic coherence and structural integrity. To rigorously evaluate our approach, we create new data-limited "Sub" variants of six standard graph benchmarks, specifically designed to test synthesis capabilities under realistic constraints. Additionally, we develop a novel interpretability assessment framework that combines human evaluation with a principled Grassmannian manifold-based analysis, providing both qualitative and quantitative measures of semantic coherence. Experimental results demonstrate that GraphMaster significantly outperforms traditional synthesis methods across multiple datasets, establishing a strong foundation for advancing GFMs in data-scarce environments.

Paper Structure

This paper contains 53 sections, 11 theorems, 85 equations, 9 figures, 11 tables, 1 algorithm.

Key Result

Theorem 1

Given a set of semantically related unit-normalized text embeddings $\{\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_K\} \subset \mathbb{S}^{d-1}$, there exists an optimal direction $\mathbf{u}^* \in \mathbb{S}^{d-1}$ that minimizes the sum of squared geodesic distances on the Grassmann manifold:

Figures (9)

  • Figure 1: GraphMaster: A hierarchical multi-agent framework for text-attributed graph synthesis.
  • Figure 2: Graph feature analysis on Children dataset.
  • Figure 3: Interpretability Analysis of GraphMaster.
  • Figure 4: Graph feature analysis on Cora dataset. The top three rows of pictures are the results of the original data-limited dataset, and the bottom three rows are the results after TAG data synthesis using GraphMaster.
  • Figure 5: Graph feature analysis on Citeseer dataset. The top three rows of pictures are the results of the original data-limited dataset, and the bottom three rows are the results after TAG data synthesis using GraphMaster.
  • ...and 4 more figures

Theorems & Definitions (27)

  • Definition 1: Grassmann Manifold
  • Theorem 1: Principal Semantic Direction
  • Proposition 1: Semantic Coherence Metric
  • proof : Proof of Theorem 1 (Principal Semantic Direction)
  • proof : Proof of Proposition 1 (Semantic Coherence Metric)
  • Theorem 2: Computational Solution
  • proof
  • Definition 2: Topological Information Density
  • Theorem 3: Information Capture Properties of the Perception Agent
  • proof
  • ...and 17 more