Table of Contents
Fetching ...

Rethinking and Benchmarking Large Language Models for Graph Reasoning

Yuwei Hu, Xinyi Huang, Zhewei Wei, Yongchao Liu, Chuntao Hong

TL;DR

The work tackles a gap in evaluating and advancing LLMs for graph reasoning by revealing shortcomings in language-based and code-augmented approaches and by constructing a more challenging GraphAlgorithm benchmark. It introduces Simple-RTC, a reasoning-then-coding framework that decouples problem understanding and algorithm design from implementation, guiding LLMs to first design graph algorithms and then code solutions. Across extensive experiments, Simple-RTC achieves near-perfect performance on existing benchmarks and substantially outperforms prior methods on GraphAlgorithm, especially on unseen and NP-hard tasks, while exposing efficiency trade-offs and the value of code reuse. The findings emphasize the need for harder, more realistic benchmarks and suggest that focusing on algorithm design within LLMs can unlock significant gains in graph reasoning applications.

Abstract

Large Language Models (LLMs) for Graph Reasoning have been extensively studied over the past two years, involving enabling LLMs to understand graph structures and reason on graphs to solve various graph problems, with graph algorithm problems being the most prevalent. Recent studies underscore the potential of LLMs in handling graph reasoning tasks, but their performance is underwhelming. In this work, we point out issues with existing methods and benchmarks, and rethink the direction that LLMs for graph reasoning should strive toward. We find that base models, e.g., GPT-4o-mini, are largely underestimated due to improper reasoning focus. Base models with reasoning focus redirected from replicating graph algorithms to designing them can easily solve most graph reasoning tasks in existing benchmarks. To truly evaluate the graph reasoning capabilities of LLMs, we construct a more challenging GraphAlgorithm benchmark, comprising 239 different graph problems and 3,041 test instances collected from 4 competition platforms. Finally, we introduce a simple and strong baseline Simple-Reasoning-Then-Coding (Simple-RTC)-which guides LLMs to design graph algorithms first and then code to address graph reasoning tasks. Simple-RTC achieves near-perfect accuracy on existing benchmarks and significantly outperforms GPT-4o-mini and all prior methods on the GraphAlgorithm benchmark. This strong baseline encourages further advancements in LLMs for Graph Reasoning in the future.

Rethinking and Benchmarking Large Language Models for Graph Reasoning

TL;DR

The work tackles a gap in evaluating and advancing LLMs for graph reasoning by revealing shortcomings in language-based and code-augmented approaches and by constructing a more challenging GraphAlgorithm benchmark. It introduces Simple-RTC, a reasoning-then-coding framework that decouples problem understanding and algorithm design from implementation, guiding LLMs to first design graph algorithms and then code solutions. Across extensive experiments, Simple-RTC achieves near-perfect performance on existing benchmarks and substantially outperforms prior methods on GraphAlgorithm, especially on unseen and NP-hard tasks, while exposing efficiency trade-offs and the value of code reuse. The findings emphasize the need for harder, more realistic benchmarks and suggest that focusing on algorithm design within LLMs can unlock significant gains in graph reasoning applications.

Abstract

Large Language Models (LLMs) for Graph Reasoning have been extensively studied over the past two years, involving enabling LLMs to understand graph structures and reason on graphs to solve various graph problems, with graph algorithm problems being the most prevalent. Recent studies underscore the potential of LLMs in handling graph reasoning tasks, but their performance is underwhelming. In this work, we point out issues with existing methods and benchmarks, and rethink the direction that LLMs for graph reasoning should strive toward. We find that base models, e.g., GPT-4o-mini, are largely underestimated due to improper reasoning focus. Base models with reasoning focus redirected from replicating graph algorithms to designing them can easily solve most graph reasoning tasks in existing benchmarks. To truly evaluate the graph reasoning capabilities of LLMs, we construct a more challenging GraphAlgorithm benchmark, comprising 239 different graph problems and 3,041 test instances collected from 4 competition platforms. Finally, we introduce a simple and strong baseline Simple-Reasoning-Then-Coding (Simple-RTC)-which guides LLMs to design graph algorithms first and then code to address graph reasoning tasks. Simple-RTC achieves near-perfect accuracy on existing benchmarks and significantly outperforms GPT-4o-mini and all prior methods on the GraphAlgorithm benchmark. This strong baseline encourages further advancements in LLMs for Graph Reasoning in the future.

Paper Structure

This paper contains 56 sections, 2 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Illustration of Language-Based, Code-Augmented methods and our Simple-RTC.
  • Figure 2: The performance of different methods on benchmarks (left) and different tasks within GraphArena, Talk like a Graph, GraphWiz and NLGraph (right). The red bars are the baseline model, GPT-4o-mini (reasoning to replicate graph algorithms), while the golden bars are our proposed Simple-RTC model (reasoning to design graph algorithms), which also uses GPT-4o-mini as the base model. We can see the shift in reasoning focus has led to a comprehensive and significant improvement in performance.
  • Figure 3: The Pipeline of Simple-RTC.
  • Figure 4: Performance comparison between base model (GPT-4o-mini) and Simple-RTC on GraphArena.
  • Figure 5: Performance of Simple-RTC with different base model on GraphAlgorithm.