Table of Contents
Fetching ...

When Do LLMs Help With Node Classification? A Comprehensive Analysis

Xixi Wu, Yifei Shen, Fangzhou Ge, Caihua Shan, Yizhu Jiao, Xiangguo Sun, Hong Cheng

TL;DR

This paper delivers a comprehensive, standardized analysis of LLM-based node classification by introducing LLMNodeBed, a multi-dataset testbed with diverse graph types, LLM backbones, and prompts. It demonstrates that LLMs can outperform traditional methods, especially in semi-supervised and text-driven regimes, while revealing limits in supervision-scarce or highly heterophilic graphs. The work dissects LLM roles as encoders, explainers, and predictors, compares Direct Inference and GFMs in zero-shot settings, and provides practical guidelines on model selection, prompt design, and computational trade-offs. The findings advance reproducible research in graph NLP and guide practitioners on when LLMs yield meaningful gains in node classification tasks.

Abstract

Node classification is a fundamental task in graph analysis, with broad applications across various fields. Recent breakthroughs in Large Language Models (LLMs) have enabled LLM-based approaches for this task. Although many studies demonstrate the impressive performance of LLM-based methods, the lack of clear design guidelines may hinder their practical application. In this work, we aim to establish such guidelines through a fair and systematic comparison of these algorithms. As a first step, we developed LLMNodeBed, a comprehensive codebase and testbed for node classification using LLMs. It includes 10 homophilic datasets, 4 heterophilic datasets, 8 LLM-based algorithms, 8 classic baselines, and 3 learning paradigms. Subsequently, we conducted extensive experiments, training and evaluating over 2,700 models, to determine the key settings (e.g., learning paradigms and homophily) and components (e.g., model size and prompt) that affect performance. Our findings uncover 8 insights, e.g., (1) LLM-based methods can significantly outperform traditional methods in a semi-supervised setting, while the advantage is marginal in a supervised setting; (2) Graph Foundation Models can beat open-source LLMs but still fall short of strong LLMs like GPT-4o in a zero-shot setting. We hope that the release of LLMNodeBed, along with our insights, will facilitate reproducible research and inspire future studies in this field. Codes and datasets are released at \href{https://llmnodebed.github.io/}{\texttt{https://llmnodebed.github.io/}}.

When Do LLMs Help With Node Classification? A Comprehensive Analysis

TL;DR

This paper delivers a comprehensive, standardized analysis of LLM-based node classification by introducing LLMNodeBed, a multi-dataset testbed with diverse graph types, LLM backbones, and prompts. It demonstrates that LLMs can outperform traditional methods, especially in semi-supervised and text-driven regimes, while revealing limits in supervision-scarce or highly heterophilic graphs. The work dissects LLM roles as encoders, explainers, and predictors, compares Direct Inference and GFMs in zero-shot settings, and provides practical guidelines on model selection, prompt design, and computational trade-offs. The findings advance reproducible research in graph NLP and guide practitioners on when LLMs yield meaningful gains in node classification tasks.

Abstract

Node classification is a fundamental task in graph analysis, with broad applications across various fields. Recent breakthroughs in Large Language Models (LLMs) have enabled LLM-based approaches for this task. Although many studies demonstrate the impressive performance of LLM-based methods, the lack of clear design guidelines may hinder their practical application. In this work, we aim to establish such guidelines through a fair and systematic comparison of these algorithms. As a first step, we developed LLMNodeBed, a comprehensive codebase and testbed for node classification using LLMs. It includes 10 homophilic datasets, 4 heterophilic datasets, 8 LLM-based algorithms, 8 classic baselines, and 3 learning paradigms. Subsequently, we conducted extensive experiments, training and evaluating over 2,700 models, to determine the key settings (e.g., learning paradigms and homophily) and components (e.g., model size and prompt) that affect performance. Our findings uncover 8 insights, e.g., (1) LLM-based methods can significantly outperform traditional methods in a semi-supervised setting, while the advantage is marginal in a supervised setting; (2) Graph Foundation Models can beat open-source LLMs but still fall short of strong LLMs like GPT-4o in a zero-shot setting. We hope that the release of LLMNodeBed, along with our insights, will facilitate reproducible research and inspire future studies in this field. Codes and datasets are released at \href{https://llmnodebed.github.io/}{\texttt{https://llmnodebed.github.io/}}.

Paper Structure

This paper contains 34 sections, 3 equations, 6 figures, 21 tables.

Figures (6)

  • Figure 1: Overview of LLMNodeBed.
  • Figure 2: Illustrations of LLM-based node classification algorithms under supervised and zero-shot settings.
  • Figure 3: Performance trends within Qwen-series in different scales using LLaGA framework in semi-supervised settings.
  • Figure 4: Performance trends within Qwen-series in different scales using LLaGA framework in supervised settings.
  • Figure 5: Biased predictions by LLM-as-Predictor methods on the Instagram dataset: Comparison of ground-truth label distributions with predictor-generated label distributions.
  • ...and 1 more figures