Table of Contents
Fetching ...

GraphOmni: A Comprehensive and Extendable Benchmark Framework for Large Language Models on Graph-theoretic Tasks

Hao Xu, Xiangru Jian, Xinjian Zhao, Wei Pang, Chao Zhang, Suyuchen Wang, Qixin Zhang, Zhengyuan Dong, Joao Monteiro, Bang Liu, Qiuzhuang Sun, Tianshu Yu

TL;DR

GraphOmni introduces a broad, extensible benchmark to evaluate how large language models reason about graph-structured problems when graphs are described in natural language. By jointly varying graph types, serialization formats, and prompting schemes, it reveals substantial interactions that shape model performance across canonical graph tasks. The study finds that state-of-the-art models show only moderate accuracy with notable variability across configurations, and it demonstrates an RL-based approach (RL-Opt/RL-Scale) that adaptively selects serialization strategies to reduce evaluation cost while preserving high accuracy. Open-source and closed-source models exhibit different sensitivities to representation choices, underscoring the need for task- and model-specific configurations. GraphOmni thus provides a scalable foundation for advancing LLM-based graph reasoning and offers practical methods to optimize prompt and representation strategies at limited cost.

Abstract

This paper introduces GraphOmni, a comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs on graph-theoretic tasks articulated in natural language. GraphOmni encompasses diverse graph types, serialization formats, and prompting schemes, significantly exceeding prior efforts in both scope and depth. Through extensive systematic evaluation, we identify critical interactions among these dimensions, demonstrating their substantial impact on model performance. Our experiments reveal that state-of-the-art models like Claude-3.5 and o4-mini consistently outperform other models, yet even these leading models exhibit substantial room for improvement. Performance variability is evident depending on the specific combinations of factors we considered, underscoring the necessity of comprehensive evaluations across these interconnected dimensions. Additionally, we observe distinct impacts of serialization and prompting strategies between open-source and closed-source models, encouraging the development of tailored approaches. Motivated by the findings, we also propose a reinforcement learning-inspired framework that adaptively selects the optimal factors influencing LLM reasoning capabilities. This flexible and extendable benchmark not only deepens our understanding of LLM performance on structured tasks but also provides a robust foundation for advancing research in LLM-based graph reasoning. The code and datasets are available at https://github.com/GAI-Community/GraphOmni.

GraphOmni: A Comprehensive and Extendable Benchmark Framework for Large Language Models on Graph-theoretic Tasks

TL;DR

GraphOmni introduces a broad, extensible benchmark to evaluate how large language models reason about graph-structured problems when graphs are described in natural language. By jointly varying graph types, serialization formats, and prompting schemes, it reveals substantial interactions that shape model performance across canonical graph tasks. The study finds that state-of-the-art models show only moderate accuracy with notable variability across configurations, and it demonstrates an RL-based approach (RL-Opt/RL-Scale) that adaptively selects serialization strategies to reduce evaluation cost while preserving high accuracy. Open-source and closed-source models exhibit different sensitivities to representation choices, underscoring the need for task- and model-specific configurations. GraphOmni thus provides a scalable foundation for advancing LLM-based graph reasoning and offers practical methods to optimize prompt and representation strategies at limited cost.

Abstract

This paper introduces GraphOmni, a comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs on graph-theoretic tasks articulated in natural language. GraphOmni encompasses diverse graph types, serialization formats, and prompting schemes, significantly exceeding prior efforts in both scope and depth. Through extensive systematic evaluation, we identify critical interactions among these dimensions, demonstrating their substantial impact on model performance. Our experiments reveal that state-of-the-art models like Claude-3.5 and o4-mini consistently outperform other models, yet even these leading models exhibit substantial room for improvement. Performance variability is evident depending on the specific combinations of factors we considered, underscoring the necessity of comprehensive evaluations across these interconnected dimensions. Additionally, we observe distinct impacts of serialization and prompting strategies between open-source and closed-source models, encouraging the development of tailored approaches. Motivated by the findings, we also propose a reinforcement learning-inspired framework that adaptively selects the optimal factors influencing LLM reasoning capabilities. This flexible and extendable benchmark not only deepens our understanding of LLM performance on structured tasks but also provides a robust foundation for advancing research in LLM-based graph reasoning. The code and datasets are available at https://github.com/GAI-Community/GraphOmni.

Paper Structure

This paper contains 57 sections, 3 equations, 28 figures, 20 tables, 1 algorithm.

Figures (28)

  • Figure 1: Overview GraphOmni. The benchmark evaluates LLMs across diverse graph types, serialization formats, and prompt schemes on a comprehensive suite of graph-theoretic tasks.
  • Figure 2: Radar charts comparing the performance of open-source (top row) and closed-source (bottom row) LLMs across six canonical graph reasoning tasks: BFS order, Connectivity, Cycle detection, Diameter calculation, Shortest path, and Triangle counting. Results are presented at three difficulty levels easy, medium, and hard.
  • Figure 3: GraphOmni Evaluation Pipeline. We convert graph‑theoretic tasks into text‑based questions about local properties such as connectivity and cycle detection, and about global properties such as triangle counting and diameter computation. In the adjustable settings, we vary three dimensions, i.e., graph type, serialization format, and prompt scheme, and then generate every possible combination. Each of them is fed to the LLM, and its output is compared against the ground truth to assess reasoning performance across all tasks.
  • Figure 4: Token usage for prompt-serialization combinations. More detailed statistics are included in Figures \ref{['fig:prompt_tokens_6']} and \ref{['fig:sel_tokens_6']}.
  • Figure 5: Performance heatmaps for different prompt schemes and serialization formats on Diameter calculation of GPT-4o. The color intensity represents the accuracy, with darker colors indicating better performance. The solid and dashed boxes highlight the best and second-best performing combinations, respectively.
  • ...and 23 more figures