Large Language Models and Mathematical Reasoning Failures
Johan Boye, Birger Moell
TL;DR
This study probes the mathematical reasoning of large language models beyond final answers by applying a curated set of fifty word problems and manually inspecting solution processes. It demonstrates a wide performance gap across models, with some attaining high accuracy yet sometimes relying on flawed reasoning, and others failing to produce correct solutions entirely. The findings underscore the importance of auditing reasoning traces, reveal persistent gaps in spatial, strategic, and multi-step deduction, and highlight the need for improved structured reasoning and constraint handling in LLMs. Overall, while modern models exhibit substantial mathematical knowledge, their problem-solving proficiency remains imperfect and context-dependent, calling for targeted methodological improvements and evaluation standards that go beyond answer correct-ness alone.
Abstract
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies that focus solely on answer correctness, we rigorously analyze both final answers and solution steps to identify reasoning failures. Evaluating eight state-of-the-art models - including Mixtral, Llama, Gemini, GPT-4o, and OpenAI's o1 variants - we find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic, sometimes producing correct answers through flawed logic. Common failure modes include unwarranted assumptions, over-reliance on numerical patterns, and difficulty translating physical intuition into mathematical steps. Manual analysis reveals that models struggle with problems requiring multi-step deduction or real-world knowledge, despite possessing broad mathematical knowledge. Our results underscore the importance of evaluating reasoning processes, not just answers, and caution against overestimating LLMs' problem-solving proficiency. The study highlights persistent gaps in LLMs' generalization abilities, emphasizing the need for targeted improvements in structured reasoning and constraint handling.
