Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models
Gaurav Srivastava, Aafiya Hussain, Sriram Srinivasan, Xuan Wang
TL;DR
This work addresses the gap that large language models’ performance on complex benchmarks does not translate to reliable basic math reasoning, especially under constrained reasoning budgets. It introduces the Overthinking Score, a principled harmonic-mean metric of accuracy and token efficiency, and a dynamic test-generation protocol across $14$ basic math tasks with $53$ LLMs to study accuracy-verbosity tradeoffs. Key findings show non-monotonic scaling, substantial token waste from extended reasoning, and sharp accuracy drops under token constraints, challenging the assumption that more reasoning improves math performance. The results advocate adaptive stopping, step-level verification, and efficiency-aware deployment to achieve reliable mathematical reasoning in LLMs.
Abstract
Large language models (LLMs) achieve impressive performance on complex mathematical benchmarks yet sometimes fail on basic math reasoning while generating unnecessarily verbose responses. In this paper, we present a systematic benchmark and comprehensive empirical study to evaluate the efficiency of reasoning in LLMs, focusing on the fundamental tradeoff between accuracy and overthinking. First, we formalize the accuracy-verbosity tradeoff. Second, we introduce the Overthinking Score, a harmonic-mean metric combining accuracy and token-efficiency for holistic model evaluation. Third, we establish an evaluation protocol with dynamically-generated data across 14 basic math tasks. Fourth, we conduct a large-scale empirical study evaluating 53 LLMs, including reasoning and quantized variants across different reasoning budgets. Our findings reveal: 1) model performance on complex benchmarks does not translate directly to basic math reasoning; 2) reasoning models generate ~18 more tokens while sometimes achieving lower accuracy and exhibit catastrophic collapse when token is constrained, dropping by ~28; 3) the accuracy-verbosity relationship is non-monotonic with extended reasoning budgets yielding diminishing returns (GPT-5/o-series models show zero accuracy gain from low -> medium -> high reasoning effort). Our findings challenge the assumption that longer reasoning in LLMs necessarily improves mathematical reasoning.
