Table of Contents
Fetching ...

Rethinking Graph Structure Learning in the Era of LLMs

Zhihan Zhang, Xunkai Li, Zhu Lei, Guang Zeng, Ronghua Li, Guoren Wang

TL;DR

This work rethinks graph structure learning (GSL) for text-attributed graphs (TAGs) in the era of large language models (LLMs) by introducing LLaTA, a training-free, decoupled framework that reframes GSL as a language-guided tree optimization task. It constructs a topology-aware structural encoding tree via structural entropy minimization, then uses tree-prompted LLM in-context inference with a Community of Thought mechanism to jointly understand topology and node text. A leaf-oriented two-step sampling procedure guides training-free graph refinement, achieving state-of-the-art performance across 11 TAG datasets while avoiding costly fine-tuning. The method demonstrates strong robustness and efficiency, scalable to large graphs, and provides a practical paradigm for integrating LLMs with GSL in real-world TAG applications.

Abstract

Recently, the emergence of LLMs has prompted researchers to integrate language descriptions into graphs, aiming to enhance model encoding capabilities from a data-centric perspective. This graph representation is called text-attributed graphs (TAGs). A review of prior advancements highlights that graph structure learning (GSL) is a pivotal technique for improving data utility, making it highly relevant to efficient TAG learning. However, most GSL methods are tailored for traditional graphs without textual information, underscoring the necessity of developing a new GSL paradigm. Despite clear motivations, it remains challenging: (1) How can we define a reasonable optimization objective for GSL in the era of LLMs, considering the massive parameters in LLM? (2) How can we design an efficient model architecture that enables seamless integration of LLM for this optimization objective? For Question 1, we reformulate existing GSL optimization objectives as a tree optimization framework, shifting the focus from obtaining a well-trained edge predictor to a language-aware tree sampler. For Question 2, we propose decoupled and training-free model design principles for LLM integration, shifting the focus from computation-intensive fine-tuning to more efficient inference. Based on this, we propose Large Language and Tree Assistant (LLaTA), which leverages tree-based LLM in-context learning to enhance the understanding of topology and text, enabling reliable inference and generating improved graph structure. Extensive experiments on 11 datasets demonstrate that LLaTA enjoys flexibility-incorporated with any backbone; scalability-outperforms other LLM-enhanced graph learning methods; effectiveness-achieves SOTA predictive performance.

Rethinking Graph Structure Learning in the Era of LLMs

TL;DR

This work rethinks graph structure learning (GSL) for text-attributed graphs (TAGs) in the era of large language models (LLMs) by introducing LLaTA, a training-free, decoupled framework that reframes GSL as a language-guided tree optimization task. It constructs a topology-aware structural encoding tree via structural entropy minimization, then uses tree-prompted LLM in-context inference with a Community of Thought mechanism to jointly understand topology and node text. A leaf-oriented two-step sampling procedure guides training-free graph refinement, achieving state-of-the-art performance across 11 TAG datasets while avoiding costly fine-tuning. The method demonstrates strong robustness and efficiency, scalable to large graphs, and provides a practical paradigm for integrating LLMs with GSL in real-world TAG applications.

Abstract

Recently, the emergence of LLMs has prompted researchers to integrate language descriptions into graphs, aiming to enhance model encoding capabilities from a data-centric perspective. This graph representation is called text-attributed graphs (TAGs). A review of prior advancements highlights that graph structure learning (GSL) is a pivotal technique for improving data utility, making it highly relevant to efficient TAG learning. However, most GSL methods are tailored for traditional graphs without textual information, underscoring the necessity of developing a new GSL paradigm. Despite clear motivations, it remains challenging: (1) How can we define a reasonable optimization objective for GSL in the era of LLMs, considering the massive parameters in LLM? (2) How can we design an efficient model architecture that enables seamless integration of LLM for this optimization objective? For Question 1, we reformulate existing GSL optimization objectives as a tree optimization framework, shifting the focus from obtaining a well-trained edge predictor to a language-aware tree sampler. For Question 2, we propose decoupled and training-free model design principles for LLM integration, shifting the focus from computation-intensive fine-tuning to more efficient inference. Based on this, we propose Large Language and Tree Assistant (LLaTA), which leverages tree-based LLM in-context learning to enhance the understanding of topology and text, enabling reliable inference and generating improved graph structure. Extensive experiments on 11 datasets demonstrate that LLaTA enjoys flexibility-incorporated with any backbone; scalability-outperforms other LLM-enhanced graph learning methods; effectiveness-achieves SOTA predictive performance.

Paper Structure

This paper contains 47 sections, 13 theorems, 41 equations, 10 figures, 8 tables, 3 algorithms.

Key Result

Theorem 1

Given an encoding tree $\mathcal{T}$ and a non-leaf node $\phi \in \mathcal{T}$, the error of topological information $\varepsilon^h(\phi)$ in community $\mathcal{C}_\phi$ is upper bounded by: $\frac{g_\phi}{2m} \log_2 \frac{\operatorname{vol}(\phi^{+})}{g_\phi}$, and $\varepsilon^h(\phi)$ gradually

Figures (10)

  • Figure 1: The overview of our proposed tree-based GSL optimization pipeline and empirical results.
  • Figure 2: (Left) The overview of LLaTA; (Right) The detailed pipeline of LLaTA, which includes: topology-aware tree prompts, reliable LLM inference and language-aware tree sampler.
  • Figure 3: Node classification performance on the Cora and Pubmed dataset under real-world scenarios (Sparsity [Edge Removal] and Noise [Edge Addition]).
  • Figure 4: Hyperparameter analysis of $K$, $\theta$ and $r$. $\theta$ and $r$ are analyzed on the History dataset.
  • Figure 5: Comparison of training and inference time for LLM-based GSL method.
  • ...and 5 more figures

Theorems & Definitions (25)

  • Theorem 1: Topological Information Capturing Properties of Encoding Tree
  • Theorem 2: Implicit Global Constraints in Low-Level Communities
  • Theorem 3: Error Bound Between Soft labels and True Labels
  • Theorem 4: High-Entropy Nodes Require Supervision
  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Theorem 1: Topological Information Capturing Properties of Encoding Tree
  • ...and 15 more