Can LLMs Reason Structurally? An Evaluation via the Lens of Data Structures
Yu He, Yingxi Li, Colin White, Ellen Vitercik
TL;DR
DSR-Bench evaluates whether LLMs can reason structurally by manipulating canonical data structures of diverse relational types. It combines a main data-structure suite, a Challenge set, and three specialized suites (Spatial, Natural, Code) to diagnose core weaknesses in constructing, maintaining, and reasoning about data relationships; outputs are automatically verifiable via a JSON schema. Across ten state-of-the-art models, results show persistent gaps: instruction-tuned models struggle with multi-attribute and multi-hop reasoning, while reasoning models remain below practical thresholds on complex structures and fail to honor user-imposed constraints. The findings highlight the need for algorithm-centric architectures and memory-enabled reasoning to bridge the gap between current LLM capabilities and real-world structural reasoning demands. DSR-Bench serves as a principled, extensible diagnostic tool for exposing bottlenecks and guiding targeted improvements toward more reliable, general-purpose reasoning systems.
Abstract
As large language models (LLMs) take on increasingly complex tasks, understanding their algorithmic reasoning abilities has become essential. However, existing evaluations focus on distinct and isolated tasks. We propose a unified diagnostic lens: structural reasoning--understanding and manipulating relationships like order, hierarchy, and connectivity. We introduce DSR-Bench, the first benchmark to systematically evaluate LLM structural reasoning through canonical data structures, which serve as interpretable, algorithmically meaningful abstractions. DSR-Bench spans 20 data structures, 35 operations, and 4,140 synthetically generated problem instances with minimal contamination. The benchmark's hierarchical design pinpoints specific failure modes, while its fully automated evaluation ensures objective and consistent assessment. Benchmarking ten state-of-the-art LLMs reveals critical limitations: the top-performing model scores only 0.498 out of 1 on challenging instances. Three additional evaluation suites reveal further weaknesses: models perform poorly on spatial data and natural language scenarios, and fail to reason over their own generated code. DSR-Bench offers a principled diagnostic tool for structural reasoning, helping expose reasoning bottlenecks and guide the development of more capable and reliable LLMs.
