Table of Contents
Fetching ...

Cognitive Foundations for Reasoning and Their Manifestation in LLMs

Priyanka Kargupta, Shuyue Stella Li, Haocheng Wang, Jinu Lee, Shan Chen, Orevaoghene Ahia, Dean Light, Thomas L. Griffiths, Max Kleiman-Weiner, Jiawei Han, Asli Celikyilmaz, Yulia Tsvetkov

TL;DR

LLMs achieve high performance yet generalize poorly on ill-defined tasks, suggesting reliance on non-human reasoning processes. The authors synthesize cognitive science into a four-dimensional taxonomy with 28 cognitive elements (Reasoning Invariants, Meta-Cognitive Controls, Reasoning Representations, Reasoning Operations) and conduct a large-scale, multimodal empirical study (192K traces plus 54 human traces) to map cognition in models and humans. They also develop test-time cognitive-structure guidance that scaffolds successful reasoning patterns, yielding up to 66.7% improvement on complex problems. The work bridges cognitive science and LLM evaluation, enabling systematic diagnosis of reasoning failures and principled development of models that deploy robust cognitive mechanisms rather than spurious shortcuts, with implications for training, evaluation, and theory-driven interventions.

Abstract

Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. To understand this gap, we synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning reasoning invariants, meta-cognitive controls, representations for organizing reasoning & knowledge, and transformation operations. We introduce a fine-grained evaluation framework and conduct the first large-scale empirical analysis of 192K traces from 18 models across text, vision, and audio, complemented by 54 human think-aloud traces, which we make publicly available. We find that models under-utilize cognitive elements correlated with success, narrowing to rigid sequential processing on ill-structured problems where diverse representations and meta-cognitive monitoring are critical. Human traces show more abstraction and conceptual processing, while models default to surface-level enumeration. Meta-analysis of 1.6K LLM reasoning papers reveals the research community concentrates on easily quantifiable elements (sequential organization: 55%, decomposition: 60%) but neglecting meta-cognitive controls (self-awareness: 16%) that correlate with success. Models possess behavioral repertoires associated with success but fail to deploy them spontaneously. Leveraging these patterns, we develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 66.7% on complex problems. By establishing a shared vocabulary between cognitive science and LLM research, our framework enables systematic diagnosis of reasoning failures and principled development of models that reason through robust cognitive mechanisms rather than spurious shortcuts, while providing tools to test theories of human cognition at scale.

Cognitive Foundations for Reasoning and Their Manifestation in LLMs

TL;DR

LLMs achieve high performance yet generalize poorly on ill-defined tasks, suggesting reliance on non-human reasoning processes. The authors synthesize cognitive science into a four-dimensional taxonomy with 28 cognitive elements (Reasoning Invariants, Meta-Cognitive Controls, Reasoning Representations, Reasoning Operations) and conduct a large-scale, multimodal empirical study (192K traces plus 54 human traces) to map cognition in models and humans. They also develop test-time cognitive-structure guidance that scaffolds successful reasoning patterns, yielding up to 66.7% improvement on complex problems. The work bridges cognitive science and LLM evaluation, enabling systematic diagnosis of reasoning failures and principled development of models that deploy robust cognitive mechanisms rather than spurious shortcuts, with implications for training, evaluation, and theory-driven interventions.

Abstract

Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. To understand this gap, we synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning reasoning invariants, meta-cognitive controls, representations for organizing reasoning & knowledge, and transformation operations. We introduce a fine-grained evaluation framework and conduct the first large-scale empirical analysis of 192K traces from 18 models across text, vision, and audio, complemented by 54 human think-aloud traces, which we make publicly available. We find that models under-utilize cognitive elements correlated with success, narrowing to rigid sequential processing on ill-structured problems where diverse representations and meta-cognitive monitoring are critical. Human traces show more abstraction and conceptual processing, while models default to surface-level enumeration. Meta-analysis of 1.6K LLM reasoning papers reveals the research community concentrates on easily quantifiable elements (sequential organization: 55%, decomposition: 60%) but neglecting meta-cognitive controls (self-awareness: 16%) that correlate with success. Models possess behavioral repertoires associated with success but fail to deploy them spontaneously. Leveraging these patterns, we develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 66.7% on complex problems. By establishing a shared vocabulary between cognitive science and LLM research, our framework enables systematic diagnosis of reasoning failures and principled development of models that reason through robust cognitive mechanisms rather than spurious shortcuts, while providing tools to test theories of human cognition at scale.

Paper Structure

This paper contains 55 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: An example of cognitive elements present in a reasoning trace for building a LEGO spaceship. We characterize the different elements along the four dimensions of our taxonomy, as shown in Table \ref{['tab:taxonomy']}.
  • Figure 2: Distribution of cognitive element presence across 1,598 arXiv LLM Reasoning papers. Partially present indicates that there is evidence that the element was considered in the design (motivation, method, evaluation) of the paper but was not the primary focus. Present indicates that there is evidence that the element was a conscious and significant design decision. Details provided in Section \ref{['sec:research_design']}.
  • Figure 3: Dataset composition and model performance across problem types. (a) Problem distribution across modalities, organized by Jonassen's problem structuredness continuum, shows coverage decreases for less structured problems. (b) Accuracy decreases as problems become less structured, with models showing consistent performance on story tasks (78.8%) but high variance on dilemmas (3.3-99.1%).
  • Figure 4: (Left) Presence rate of each cognitive element for each problem type (ranging from well-structured to ill-structured). (Right) Positive Pointwise Mutual Information (PPMI) between the problem type and cognitive element (correlation between their behavioral occurrence and reasoning trace success).
  • Figure 5: (Left) Presence rate of each cognitive element for each model (how often is the element occurring across all reasoning traces for a model) across all modalities. Average rates per model: DeepSeek-R1: 0.458, Olmo-3-7B-Think: 0.491, Olmo-3-32B-Think: 0.484, Qwen3-8B: 0.357, Qwen3-14B: 0.35, Qwen3-32B: 0.384, DeepSeek-R1-Distill-Qwen-1.5B: 0.316, DeepSeek-R1-Distill-Qwen-7B: 0.334, DeepSeek-R1-Distill-Qwen-14B: 0.349, DeepSeek-R1-Distill-Qwen-32B: 0.36, DeepSeek-R1-Distill-Llama-8B: 0.315, DeepSeek-R1-Distill-Llama-70B: 0.346, DeepScaleR-1.5B-Preview: 0.317, DeepHermes-3-Llama-3-8B-Preview: 0.122, OpenThinker-32B: 0.505, s1.1-32B: 0.597, Qwen3-Omni-30B (audio): 0.253, and Zebra-CoT (image): 0.348. (Right) Average reasoning trace length (# characters) for each model per problem type.
  • ...and 2 more figures