Table of Contents
Fetching ...

Large Language Model Reasoning Failures

Peiyang Song, Pengrui Han, Noah Goodman

TL;DR

This work provides the first comprehensive taxonomy and synthesis of reasoning failures in Large Language Models, separating failures into fundamental, domain-specific, and robustness categories across informal, formal, and embodied reasoning. It surveys cognitive biases, ToM and social-norm reasoning, logic in natural language, math and programming benchmarks, and 1D/2D/3D embodied reasoning, linking failures to architectural and training-root causes. The authors discuss mitigation strategies (data/algorithmic/ prompting-based, tool integration, or external simulators) and advocate for unified, dynamic robustness benchmarks, failure-injection benchmarks, and deeper grounding via world models. The study aims to guide future research toward stronger, more reliable, and verifiable LLM reasoning, with a public GitHub collection to support ongoing work and replication.

Abstract

Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non-embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application-specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures, to provide an easy entry point to this area.

Large Language Model Reasoning Failures

TL;DR

This work provides the first comprehensive taxonomy and synthesis of reasoning failures in Large Language Models, separating failures into fundamental, domain-specific, and robustness categories across informal, formal, and embodied reasoning. It surveys cognitive biases, ToM and social-norm reasoning, logic in natural language, math and programming benchmarks, and 1D/2D/3D embodied reasoning, linking failures to architectural and training-root causes. The authors discuss mitigation strategies (data/algorithmic/ prompting-based, tool integration, or external simulators) and advocate for unified, dynamic robustness benchmarks, failure-injection benchmarks, and deeper grounding via world models. The study aims to guide future research toward stronger, more reliable, and verifiable LLM reasoning, with a public GitHub collection to support ongoing work and replication.

Abstract

Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non-embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application-specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures, to provide an easy entry point to this area.
Paper Structure (57 sections, 5 figures, 15 tables)

This paper contains 57 sections, 5 figures, 15 tables.

Figures (5)

  • Figure 1: A Taxonomy of LLM Reasoning Failures. We adopt a nuanced 2-axis structure (reasoning type $\times$ failure type), with each row representing a reasoning category and each column a failure category. A more detailed explanation is presented in Section \ref{['sec:foundation']}.
  • Figure 2: Reasoning Taxonomy & Main Survey Structure.
  • Figure 3: Taxonomy of Informal LLM Reasoning Failures.
  • Figure 4: Taxonomy of Formal LLM Reasoning Failures.
  • Figure 5: Taxonomy of Embodied LLM Reasoning Failures.