GraphArena: Evaluating and Exploring Large Language Models on Graph Computation

Jianheng Tang; Qifan Zhang; Yuhan Li; Nuo Chen; Jia Li

GraphArena: Evaluating and Exploring Large Language Models on Graph Computation

Jianheng Tang, Qifan Zhang, Yuhan Li, Nuo Chen, Jia Li

TL;DR

GraphArena presents a comprehensive benchmark to evaluate large language models on real-world graph computation by assembling 10,000 problems across four polynomial-time tasks and six NP-complete tasks drawn from real-world graphs. It introduces a rigorous path-based evaluation that separates correct, feasible-but-suboptimal, hallucinatory, and missing outputs, and compares ten LLMs against classical graph algorithms and Graph-LLM hybrids. Key findings show that, while LLMs excel at basic algorithmic tasks, they struggle with NP-complete problems on larger graphs and exhibit increasing hallucinations as problem complexity grows; four mitigation strategies—chain-of-thought prompting, instruction tuning, code execution, and scaled test-time compute—offer varying benefits depending on task scale. The work provides a new, open-source benchmark that highlights the limits of current LLMs in graph reasoning and points to practical paths for enhancing their reasoning capabilities for real-world graph tasks.

Abstract

The ``arms race'' of Large Language Models (LLMs) demands new benchmarks to examine their progresses. In this paper, we introduce GraphArena, a benchmarking tool designed to evaluate LLMs on real-world graph computational problems. It offers a suite of four polynomial-time tasks (e.g., Shortest Distance) and six NP-complete challenges (e.g., Traveling Salesman Problem). GraphArena features a rigorous evaluation framework that classifies LLM outputs as correct, suboptimal (feasible but not optimal), hallucinatory (properly formatted but infeasible), or missing. Evaluation of over 10 LLMs reveals that even top-performing LLMs struggle with larger, more complex graph problems and exhibit hallucination issues. We further explore four potential solutions to address this issue and improve LLMs on graph computation, including chain-of-thought prompting, instruction tuning, code writing, and scaling test-time compute, each demonstrating unique strengths and limitations. GraphArena complements the existing LLM benchmarks and is open-sourced at https://github.com/squareRoot3/GraphArena.

GraphArena: Evaluating and Exploring Large Language Models on Graph Computation

TL;DR

Abstract

Paper Structure (13 sections, 10 figures, 18 tables)

This paper contains 13 sections, 10 figures, 18 tables.

Introduction
Benchmark Construction
Dataset Collection
Task Selection
Evaluation Process
Experiments
Main Results
Exploring strategies to enhance LLMs on Graph Computation
Related Work
Conclusion
Additional Dataset Information
Additional Task Information
Additional Experimental Results

Figures (10)

Figure 1: Overview of the GraphArena benchmark.
Figure 2: Feasibility and accuracy comparison of five selected LLMs on each individual task. The circles represent performance levels, progressing outward from 20% to 100% in increments of 20%.
Figure 3: The percentage of problems where GPT-4o wins, ties, or loses against Random, Greedy, and Approximated algorithms.
Figure 4: The influence of graph size on hallucination probability for the Maximum Independent Set, Graph Diameter, and Connected Component tasks.
Figure 5: Performance comparison of GraphToken and its backbone LLM and GNN.
...and 5 more figures

GraphArena: Evaluating and Exploring Large Language Models on Graph Computation

TL;DR

Abstract

GraphArena: Evaluating and Exploring Large Language Models on Graph Computation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)