GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models

Zike Yuan; Ming Liu; Hui Wang; Bing Qin

GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models

Zike Yuan, Ming Liu, Hui Wang, Bing Qin

TL;DR

GraCoRe introduces a three-tier hierarchical taxonomy to benchmark LLMs on graph understanding and reasoning across pure and heterogeneous graphs, addressing fragmentation in existing benchmarks. With 11 datasets and 5,140 graphs, it defines 19 tasks across 10 capabilities and evaluates 12 models using exact-match accuracy, standardizing scores via $z$-scores and $s$-scores to enable cross-task comparisons. Key findings show graph reasoning remains a major weakness, semantic enrichment aids reasoning, node order impacts results, and longer textual descriptions do not guarantee better performance. The benchmark is publicly open-sourced at https://github.com/ZIKEYUAN/GraCoRe and aims to guide future research in graph-aware LLM capabilities and benchmarking.

Abstract

Evaluating the graph comprehension and reasoning abilities of Large Language Models (LLMs) is challenging and often incomplete. Existing benchmarks focus primarily on pure graph understanding, lacking a comprehensive evaluation across all graph types and detailed capability definitions. This paper presents GraCoRe, a benchmark for systematically assessing LLMs' graph comprehension and reasoning. GraCoRe uses a three-tier hierarchical taxonomy to categorize and test models on pure graph and heterogeneous graphs, subdividing capabilities into 10 distinct areas tested through 19 tasks. Our benchmark includes 11 datasets with 5,140 graphs of varying complexity. We evaluate four closed-source and eight open-source LLMs, conducting thorough analyses from both ability and task perspectives. Key findings reveal that OpenAI o1 model has amazing comprehension and reasoning capabilities, semantic enrichment enhances reasoning performance, node ordering impacts task success, and the ability to process longer texts does not necessarily improve graph comprehension or reasoning.GraCoRe is open-sourced at https://github.com/ZIKEYUAN/GraCoRe

GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models

TL;DR

-scores and

-scores to enable cross-task comparisons. Key findings show graph reasoning remains a major weakness, semantic enrichment aids reasoning, node order impacts results, and longer textual descriptions do not guarantee better performance. The benchmark is publicly open-sourced at https://github.com/ZIKEYUAN/GraCoRe and aims to guide future research in graph-aware LLM capabilities and benchmarking.

Abstract

Paper Structure (18 sections, 2 equations, 26 figures, 10 tables)

This paper contains 18 sections, 2 equations, 26 figures, 10 tables.

Introduction
Related Work
GraCoRe
Hierarchical Ability Taxonomy
Graph understanding
Graph reasoning
Data Collection
Data Statistics
Evaluation
Experiments
Experimental Setup
Main Results
Further Analysis
Conclusion
Comparison with existing benchmarks
...and 3 more sections

Figures (26)

Figure 1: GraCoRe encompasses two overarching abilities and 19 distinct tasks within LLM on graph scenarios, facilitating a granular benchmarking from basic perceptivity to advanced interactivity.
Figure 2: Our three-tier hierarchical ability taxonomy.
Figure 3: Performance of various LLMs for second and third layer ability dimension.
Figure 4: Effect of Graph Size.
Figure 5: Effect of Text Enhancement.
...and 21 more figures

GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models

TL;DR

Abstract

GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (26)