Table of Contents
Fetching ...

Revisiting the Graph Reasoning Ability of Large Language Models: Case Studies in Translation, Connectivity and Shortest Path

Xinnan Dai, Qihao Wen, Yifei Shen, Hongzhi Wen, Dongsheng Li, Jiliang Tang, Caihua Shan

TL;DR

This paper critically re-evaluates the graph reasoning capabilities of large language models by focusing on three fundamental tasks—graph description translation, graph connectivity, and the shortest-path problem—across balanced synthetic datasets and real-world knowledge graphs. It systematically analyzes how graph description methods, connectivity types, and prompt strategies affect performance, revealing persistent failures when graphs are described purely in text and longer or more complex graphs. The authors demonstrate that Node List descriptions, meaningful node naming, and algorithm-guided prompts (e.g., BFS-CoT, Dijkstra-CoT) can boost reasoning, and that model scale and training data substantially improve outcomes, with GPT-4 typically outperforming GPT-3 and LLama variants. The findings offer concrete guidelines for dataset design, prompt construction, and model tuning to enhance graph reasoning in AI systems, while highlighting intrinsic limitations of text-only graph understanding.

Abstract

Large Language Models (LLMs) have achieved great success in various reasoning tasks. In this work, we focus on the graph reasoning ability of LLMs. Although theoretical studies proved that LLMs are capable of handling graph reasoning tasks, empirical evaluations reveal numerous failures. To deepen our understanding on this discrepancy, we revisit the ability of LLMs on three fundamental graph tasks: graph description translation, graph connectivity, and the shortest-path problem. Our findings suggest that LLMs can fail to understand graph structures through text descriptions and exhibit varying performance for all these three fundamental tasks. Meanwhile, we perform a real-world investigation on knowledge graphs and make consistent observations with our findings. The codes and datasets are available.

Revisiting the Graph Reasoning Ability of Large Language Models: Case Studies in Translation, Connectivity and Shortest Path

TL;DR

This paper critically re-evaluates the graph reasoning capabilities of large language models by focusing on three fundamental tasks—graph description translation, graph connectivity, and the shortest-path problem—across balanced synthetic datasets and real-world knowledge graphs. It systematically analyzes how graph description methods, connectivity types, and prompt strategies affect performance, revealing persistent failures when graphs are described purely in text and longer or more complex graphs. The authors demonstrate that Node List descriptions, meaningful node naming, and algorithm-guided prompts (e.g., BFS-CoT, Dijkstra-CoT) can boost reasoning, and that model scale and training data substantially improve outcomes, with GPT-4 typically outperforming GPT-3 and LLama variants. The findings offer concrete guidelines for dataset design, prompt construction, and model tuning to enhance graph reasoning in AI systems, while highlighting intrinsic limitations of text-only graph understanding.

Abstract

Large Language Models (LLMs) have achieved great success in various reasoning tasks. In this work, we focus on the graph reasoning ability of LLMs. Although theoretical studies proved that LLMs are capable of handling graph reasoning tasks, empirical evaluations reveal numerous failures. To deepen our understanding on this discrepancy, we revisit the ability of LLMs on three fundamental graph tasks: graph description translation, graph connectivity, and the shortest-path problem. Our findings suggest that LLMs can fail to understand graph structures through text descriptions and exhibit varying performance for all these three fundamental tasks. Meanwhile, we perform a real-world investigation on knowledge graphs and make consistent observations with our findings. The codes and datasets are available.
Paper Structure (45 sections, 2 equations, 11 figures, 13 tables)

This paper contains 45 sections, 2 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: The overview of datasets in accuracy and distribution across different connectivity types. We evaluate GPT-3 on determining whether a path exists between two nodes. Previous work wu2024can primarily focused on 1-hop and 2-hop connections, resulting in high accuracy. However, it overlooked the fact that accuracy tends to drop when extending to 3, 4, and 5-hop connections.
  • Figure 2: Three types of graph descriptions. A graph can be described by an adjacency matrix, edge list, and neighborhood node sets.
  • Figure 3: Different types of connectivity. The directed graph consists of 8 nodes, where solid lines represent the existence of directed edges, and dotted lines indicate no edge exists. Four connectivity types include: (A) K-hop: nodes 5 and 6 connect to node 4 within 1-hop and 2-hops, respectively. (B) Singleton: node 3 is an isolated node and not attached to node 4; (C) Isolated Components: nodes 2 and 4 belong to separate components with no path-connected edge; (D) Asymmetric: node 6 is directed towards node 7 but lacks any connection in an asymmetric configuration.
  • Figure 4: 3-hop results
  • Figure 5: 5-hop results
  • ...and 6 more figures