Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

Chang Liu; Bo Wu

Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

Chang Liu, Bo Wu

TL;DR

The paper systematically evaluates how large language models handle graph-structured data by converting graphs into natural-language representations and testing across connectivity, neighborhood, degree, pattern matching, and all-shortest-path tasks. It analyzes four prompting regimes and four evaluation metrics to assess comprehension, correctness, fidelity, and rectification. Findings show GPT models outperform open-source rivals in comprehension and correctness, yet all models struggle with complex structural reasoning and multi-answer fidelity, with prompting sometimes hindering rather than helping. The work highlights the potential of LLMs for graph analytics while underscoring the need for calibration and tools to ensure reliable, self-correcting behavior in graph-related reasoning.

Abstract

Large Language Models (LLMs) have garnered considerable interest within both academic and industrial. Yet, the application of LLMs to graph data remains under-explored. In this study, we evaluate the capabilities of four LLMs in addressing several analytical problems with graph data. We employ four distinct evaluation metrics: Comprehension, Correctness, Fidelity, and Rectification. Our results show that: 1) LLMs effectively comprehend graph data in natural language and reason with graph topology. 2) GPT models can generate logical and coherent results, outperforming alternatives in correctness. 3) All examined LLMs face challenges in structural reasoning, with techniques like zero-shot chain-of-thought and few-shot prompting showing diminished efficacy. 4) GPT models often produce erroneous answers in multi-answer tasks, raising concerns in fidelity. 5) GPT models exhibit elevated confidence in their outputs, potentially hindering their rectification capacities. Notably, GPT-4 has demonstrated the capacity to rectify responses from GPT-3.5-turbo and its own previous iterations. The code is available at: https://github.com/Ayame1006/LLMtoGraph.

Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

TL;DR

Abstract

Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (1)