Table of Contents
Fetching ...

Can LLMs Reason Structurally? An Evaluation via the Lens of Data Structures

Yu He, Yingxi Li, Colin White, Ellen Vitercik

TL;DR

DSR-Bench evaluates whether LLMs can reason structurally by manipulating canonical data structures of diverse relational types. It combines a main data-structure suite, a Challenge set, and three specialized suites (Spatial, Natural, Code) to diagnose core weaknesses in constructing, maintaining, and reasoning about data relationships; outputs are automatically verifiable via a JSON schema. Across ten state-of-the-art models, results show persistent gaps: instruction-tuned models struggle with multi-attribute and multi-hop reasoning, while reasoning models remain below practical thresholds on complex structures and fail to honor user-imposed constraints. The findings highlight the need for algorithm-centric architectures and memory-enabled reasoning to bridge the gap between current LLM capabilities and real-world structural reasoning demands. DSR-Bench serves as a principled, extensible diagnostic tool for exposing bottlenecks and guiding targeted improvements toward more reliable, general-purpose reasoning systems.

Abstract

As large language models (LLMs) take on increasingly complex tasks, understanding their algorithmic reasoning abilities has become essential. However, existing evaluations focus on distinct and isolated tasks. We propose a unified diagnostic lens: structural reasoning--understanding and manipulating relationships like order, hierarchy, and connectivity. We introduce DSR-Bench, the first benchmark to systematically evaluate LLM structural reasoning through canonical data structures, which serve as interpretable, algorithmically meaningful abstractions. DSR-Bench spans 20 data structures, 35 operations, and 4,140 synthetically generated problem instances with minimal contamination. The benchmark's hierarchical design pinpoints specific failure modes, while its fully automated evaluation ensures objective and consistent assessment. Benchmarking ten state-of-the-art LLMs reveals critical limitations: the top-performing model scores only 0.498 out of 1 on challenging instances. Three additional evaluation suites reveal further weaknesses: models perform poorly on spatial data and natural language scenarios, and fail to reason over their own generated code. DSR-Bench offers a principled diagnostic tool for structural reasoning, helping expose reasoning bottlenecks and guide the development of more capable and reliable LLMs.

Can LLMs Reason Structurally? An Evaluation via the Lens of Data Structures

TL;DR

DSR-Bench evaluates whether LLMs can reason structurally by manipulating canonical data structures of diverse relational types. It combines a main data-structure suite, a Challenge set, and three specialized suites (Spatial, Natural, Code) to diagnose core weaknesses in constructing, maintaining, and reasoning about data relationships; outputs are automatically verifiable via a JSON schema. Across ten state-of-the-art models, results show persistent gaps: instruction-tuned models struggle with multi-attribute and multi-hop reasoning, while reasoning models remain below practical thresholds on complex structures and fail to honor user-imposed constraints. The findings highlight the need for algorithm-centric architectures and memory-enabled reasoning to bridge the gap between current LLM capabilities and real-world structural reasoning demands. DSR-Bench serves as a principled, extensible diagnostic tool for exposing bottlenecks and guiding targeted improvements toward more reliable, general-purpose reasoning systems.

Abstract

As large language models (LLMs) take on increasingly complex tasks, understanding their algorithmic reasoning abilities has become essential. However, existing evaluations focus on distinct and isolated tasks. We propose a unified diagnostic lens: structural reasoning--understanding and manipulating relationships like order, hierarchy, and connectivity. We introduce DSR-Bench, the first benchmark to systematically evaluate LLM structural reasoning through canonical data structures, which serve as interpretable, algorithmically meaningful abstractions. DSR-Bench spans 20 data structures, 35 operations, and 4,140 synthetically generated problem instances with minimal contamination. The benchmark's hierarchical design pinpoints specific failure modes, while its fully automated evaluation ensures objective and consistent assessment. Benchmarking ten state-of-the-art LLMs reveals critical limitations: the top-performing model scores only 0.498 out of 1 on challenging instances. Three additional evaluation suites reveal further weaknesses: models perform poorly on spatial data and natural language scenarios, and fail to reason over their own generated code. DSR-Bench offers a principled diagnostic tool for structural reasoning, helping expose reasoning bottlenecks and guide the development of more capable and reliable LLMs.

Paper Structure

This paper contains 95 sections, 4 figures, 36 tables.

Figures (4)

  • Figure 1: Overview of DSR-Bench's main suite with six data structure categories capturing distinct relationships, plus the challenge subset. Three specialized suites that holistically evaluate structural reasoning under different settings: spatial (multi-dimensional data), natural (realistic natural language scenarios), and code (code generation).
  • Figure 2: Left: Scores of ten models on DSR-Bench-main, averaged across three runs. Right: Radar chart showing scores of top-performing models across six data structure categories. We note DSR-Bench also includes a challenge suite, where the best model scores only 0.498.
  • Figure 3: Example K-D Tree instances from three non-uniform distributions.
  • Figure 4: The pipeline for generating natural language prompts.