Table of Contents
Fetching ...

Leveraging Large Language Models for Effective Label-free Node Classification in Text-Attributed Graphs

Taiyan Zhang, Renchi Yang, Yurui Lai, Mingyu Yan, Xiaochun Ye, Dongrui Fan

TL;DR

This work tackles the challenge of label-scarce node classification on text-attributed graphs by introducing Locle, a two-stage framework that fuses GNNs and LLMs under a fixed query budget $B$. Stage I uses a subspace-clustering–driven active node selection to annotate $B_{ini}=\varepsilon B$ nodes with LLMs, while Stage II performs multi-round self-training where GNNs identify informative samples via label entropy $LE$ and label disharmonicity $LH$, and a graph rewiring step refines labels with Dirichlet-energy guidance. A hybrid refinement scheme combines LLM and GNN predictions on a rewired topology, supported by a two-term objective $ ext{L}_{CLS}$ and $ ext{L}_{DE}$ and theoretical links between $LH$, Dirichlet energy, and spectral clustering. Experiments on five TAG datasets show Locle achieving state-of-the-art zero-shot accuracy with substantial improvements over baselines and favorable cost-accuracy trade-offs, including an 8.08% gain on DBLP at low cost. These results demonstrate a practical path to scalable, label-free graph learning by effectively wedging LLM capabilities with structure-aware self-training.

Abstract

Graph neural networks (GNNs) have become the preferred models for node classification in graph data due to their robust capabilities in integrating graph structures and attributes. However, these models heavily depend on a substantial amount of high-quality labeled data for training, which is often costly to obtain. With the rise of large language models (LLMs), a promising approach is to utilize their exceptional zero-shot capabilities and extensive knowledge for node labeling. Despite encouraging results, this approach either requires numerous queries to LLMs or suffers from reduced performance due to noisy labels generated by LLMs. To address these challenges, we introduce Locle, an active self-training framework that does Label-free node Classification with LLMs cost-Effectively. Locle iteratively identifies small sets of "critical" samples using GNNs and extracts informative pseudo-labels for them with both LLMs and GNNs, serving as additional supervision signals to enhance model training. Specifically, Locle comprises three key components: (i) an effective active node selection strategy for initial annotations; (ii) a careful sample selection scheme to identify "critical" nodes based on label disharmonicity and entropy; and (iii) a label refinement module that combines LLMs and GNNs with a rewired topology. Extensive experiments on five benchmark text-attributed graph datasets demonstrate that Locle significantly outperforms state-of-the-art methods under the same query budget to LLMs in terms of label-free node classification. Notably, on the DBLP dataset with 14.3k nodes, Locle achieves an 8.08% improvement in accuracy over the state-of-the-art at a cost of less than one cent. Our code is available at https://github.com/HKBU-LAGAS/Locle.

Leveraging Large Language Models for Effective Label-free Node Classification in Text-Attributed Graphs

TL;DR

This work tackles the challenge of label-scarce node classification on text-attributed graphs by introducing Locle, a two-stage framework that fuses GNNs and LLMs under a fixed query budget . Stage I uses a subspace-clustering–driven active node selection to annotate nodes with LLMs, while Stage II performs multi-round self-training where GNNs identify informative samples via label entropy and label disharmonicity , and a graph rewiring step refines labels with Dirichlet-energy guidance. A hybrid refinement scheme combines LLM and GNN predictions on a rewired topology, supported by a two-term objective and and theoretical links between , Dirichlet energy, and spectral clustering. Experiments on five TAG datasets show Locle achieving state-of-the-art zero-shot accuracy with substantial improvements over baselines and favorable cost-accuracy trade-offs, including an 8.08% gain on DBLP at low cost. These results demonstrate a practical path to scalable, label-free graph learning by effectively wedging LLM capabilities with structure-aware self-training.

Abstract

Graph neural networks (GNNs) have become the preferred models for node classification in graph data due to their robust capabilities in integrating graph structures and attributes. However, these models heavily depend on a substantial amount of high-quality labeled data for training, which is often costly to obtain. With the rise of large language models (LLMs), a promising approach is to utilize their exceptional zero-shot capabilities and extensive knowledge for node labeling. Despite encouraging results, this approach either requires numerous queries to LLMs or suffers from reduced performance due to noisy labels generated by LLMs. To address these challenges, we introduce Locle, an active self-training framework that does Label-free node Classification with LLMs cost-Effectively. Locle iteratively identifies small sets of "critical" samples using GNNs and extracts informative pseudo-labels for them with both LLMs and GNNs, serving as additional supervision signals to enhance model training. Specifically, Locle comprises three key components: (i) an effective active node selection strategy for initial annotations; (ii) a careful sample selection scheme to identify "critical" nodes based on label disharmonicity and entropy; and (iii) a label refinement module that combines LLMs and GNNs with a rewired topology. Extensive experiments on five benchmark text-attributed graph datasets demonstrate that Locle significantly outperforms state-of-the-art methods under the same query budget to LLMs in terms of label-free node classification. Notably, on the DBLP dataset with 14.3k nodes, Locle achieves an 8.08% improvement in accuracy over the state-of-the-art at a cost of less than one cent. Our code is available at https://github.com/HKBU-LAGAS/Locle.

Paper Structure

This paper contains 40 sections, 2 theorems, 20 equations, 5 figures, 14 tables.

Key Result

lemma 1

The spectral clustering of $\boldsymbol{S}\xspace$ with $K$ desired clusters is equivalent to applying the $K$-Means over $\boldsymbol{U}\xspace$.

Figures (5)

  • Figure 1: Varying #labeled nodes.
  • Figure 2: Pipeline of Our Proposed Locle
  • Figure 3: Varying $B$ in Locle.
  • Figure 4: Varying $R$ and $\varepsilon$ in Locle.
  • Figure 5: Varying parameters in Locle.

Theorems & Definitions (2)

  • lemma 1
  • lemma 2