Table of Contents
Fetching ...

GLBench: A Comprehensive Benchmark for Graph with Large Language Models

Yuhan Li, Peisong Wang, Xiao Zhu, Aochuan Chen, Haiyun Jiang, Deng Cai, Victor Wai Kin Chan, Jia Li

TL;DR

GLBench is introduced, the first comprehensive benchmark for evaluating GraphLLM methods in both supervised and zero-shot scenarios and reveals that both structures and semantics are crucial for effective zero-shot transfer, and the proposed simple baseline can even outperform several models tailored for zero-shot scenarios.

Abstract

The emergence of large language models (LLMs) has revolutionized the way we interact with graphs, leading to a new paradigm called GraphLLM. Despite the rapid development of GraphLLM methods in recent years, the progress and understanding of this field remain unclear due to the lack of a benchmark with consistent experimental protocols. To bridge this gap, we introduce GLBench, the first comprehensive benchmark for evaluating GraphLLM methods in both supervised and zero-shot scenarios. GLBench provides a fair and thorough evaluation of different categories of GraphLLM methods, along with traditional baselines such as graph neural networks. Through extensive experiments on a collection of real-world datasets with consistent data processing and splitting strategies, we have uncovered several key findings. Firstly, GraphLLM methods outperform traditional baselines in supervised settings, with LLM-as-enhancers showing the most robust performance. However, using LLMs as predictors is less effective and often leads to uncontrollable output issues. We also notice that no clear scaling laws exist for current GraphLLM methods. In addition, both structures and semantics are crucial for effective zero-shot transfer, and our proposed simple baseline can even outperform several models tailored for zero-shot scenarios. The data and code of the benchmark can be found at https://github.com/NineAbyss/GLBench.

GLBench: A Comprehensive Benchmark for Graph with Large Language Models

TL;DR

GLBench is introduced, the first comprehensive benchmark for evaluating GraphLLM methods in both supervised and zero-shot scenarios and reveals that both structures and semantics are crucial for effective zero-shot transfer, and the proposed simple baseline can even outperform several models tailored for zero-shot scenarios.

Abstract

The emergence of large language models (LLMs) has revolutionized the way we interact with graphs, leading to a new paradigm called GraphLLM. Despite the rapid development of GraphLLM methods in recent years, the progress and understanding of this field remain unclear due to the lack of a benchmark with consistent experimental protocols. To bridge this gap, we introduce GLBench, the first comprehensive benchmark for evaluating GraphLLM methods in both supervised and zero-shot scenarios. GLBench provides a fair and thorough evaluation of different categories of GraphLLM methods, along with traditional baselines such as graph neural networks. Through extensive experiments on a collection of real-world datasets with consistent data processing and splitting strategies, we have uncovered several key findings. Firstly, GraphLLM methods outperform traditional baselines in supervised settings, with LLM-as-enhancers showing the most robust performance. However, using LLMs as predictors is less effective and often leads to uncontrollable output issues. We also notice that no clear scaling laws exist for current GraphLLM methods. In addition, both structures and semantics are crucial for effective zero-shot transfer, and our proposed simple baseline can even outperform several models tailored for zero-shot scenarios. The data and code of the benchmark can be found at https://github.com/NineAbyss/GLBench.
Paper Structure (28 sections, 4 figures, 10 tables)

This paper contains 28 sections, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Timeline of GraphLLM research. Existing methods can be divided into three categories based on the role played by LLM. Top left corner illustrates the key differences of roles.
  • Figure 2: Comparison of the supervised performance among all methods. The color of the box plot represents the average score for each metric, while the central line within the box indicates the median score. We exclude results falling below 50% as they significantly deviate from the data center.
  • Figure 3: Training time and space analysis on Cora, Citeseer, WikiCS, and Instagram.
  • Figure 4: Model scaling behaviors of OFA and GLEM with varying model depths. $R^2$ close to 1 indicates that the model has a strong explanatory power. $R^2$ close to 0 means that the model explains or predicts 0% of the relationship between the dependent and independent variables.