Table of Contents
Fetching ...

Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models

Gaurav Srivastava, Aafiya Hussain, Sriram Srinivasan, Xuan Wang

TL;DR

This work addresses the gap that large language models’ performance on complex benchmarks does not translate to reliable basic math reasoning, especially under constrained reasoning budgets. It introduces the Overthinking Score, a principled harmonic-mean metric of accuracy and token efficiency, and a dynamic test-generation protocol across $14$ basic math tasks with $53$ LLMs to study accuracy-verbosity tradeoffs. Key findings show non-monotonic scaling, substantial token waste from extended reasoning, and sharp accuracy drops under token constraints, challenging the assumption that more reasoning improves math performance. The results advocate adaptive stopping, step-level verification, and efficiency-aware deployment to achieve reliable mathematical reasoning in LLMs.

Abstract

Large language models (LLMs) achieve impressive performance on complex mathematical benchmarks yet sometimes fail on basic math reasoning while generating unnecessarily verbose responses. In this paper, we present a systematic benchmark and comprehensive empirical study to evaluate the efficiency of reasoning in LLMs, focusing on the fundamental tradeoff between accuracy and overthinking. First, we formalize the accuracy-verbosity tradeoff. Second, we introduce the Overthinking Score, a harmonic-mean metric combining accuracy and token-efficiency for holistic model evaluation. Third, we establish an evaluation protocol with dynamically-generated data across 14 basic math tasks. Fourth, we conduct a large-scale empirical study evaluating 53 LLMs, including reasoning and quantized variants across different reasoning budgets. Our findings reveal: 1) model performance on complex benchmarks does not translate directly to basic math reasoning; 2) reasoning models generate ~18 more tokens while sometimes achieving lower accuracy and exhibit catastrophic collapse when token is constrained, dropping by ~28; 3) the accuracy-verbosity relationship is non-monotonic with extended reasoning budgets yielding diminishing returns (GPT-5/o-series models show zero accuracy gain from low -> medium -> high reasoning effort). Our findings challenge the assumption that longer reasoning in LLMs necessarily improves mathematical reasoning.

Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models

TL;DR

This work addresses the gap that large language models’ performance on complex benchmarks does not translate to reliable basic math reasoning, especially under constrained reasoning budgets. It introduces the Overthinking Score, a principled harmonic-mean metric of accuracy and token efficiency, and a dynamic test-generation protocol across basic math tasks with LLMs to study accuracy-verbosity tradeoffs. Key findings show non-monotonic scaling, substantial token waste from extended reasoning, and sharp accuracy drops under token constraints, challenging the assumption that more reasoning improves math performance. The results advocate adaptive stopping, step-level verification, and efficiency-aware deployment to achieve reliable mathematical reasoning in LLMs.

Abstract

Large language models (LLMs) achieve impressive performance on complex mathematical benchmarks yet sometimes fail on basic math reasoning while generating unnecessarily verbose responses. In this paper, we present a systematic benchmark and comprehensive empirical study to evaluate the efficiency of reasoning in LLMs, focusing on the fundamental tradeoff between accuracy and overthinking. First, we formalize the accuracy-verbosity tradeoff. Second, we introduce the Overthinking Score, a harmonic-mean metric combining accuracy and token-efficiency for holistic model evaluation. Third, we establish an evaluation protocol with dynamically-generated data across 14 basic math tasks. Fourth, we conduct a large-scale empirical study evaluating 53 LLMs, including reasoning and quantized variants across different reasoning budgets. Our findings reveal: 1) model performance on complex benchmarks does not translate directly to basic math reasoning; 2) reasoning models generate ~18 more tokens while sometimes achieving lower accuracy and exhibit catastrophic collapse when token is constrained, dropping by ~28; 3) the accuracy-verbosity relationship is non-monotonic with extended reasoning budgets yielding diminishing returns (GPT-5/o-series models show zero accuracy gain from low -> medium -> high reasoning effort). Our findings challenge the assumption that longer reasoning in LLMs necessarily improves mathematical reasoning.

Paper Structure

This paper contains 124 sections, 24 theorems, 49 equations, 4 figures, 11 tables, 6 algorithms.

Key Result

Theorem 1

The Overthinking Score $\mathcal{O}_i$ satisfies:

Figures (4)

  • Figure 1: Three dimensions of our evaluation. (a) Scaling performance across different families:Llama shows large jumps between small and large models. Qwen2.5 shows generally monotonic but sublinear scaling. Qwen3 exhibits non-monotonic behavior with 14B outperforming 32B. (b) Token budget constraints:Qwen3 reasoning models accuracy under full budget vs. 1024-token limit reveals catastrophic degradation. (c) Quantization robustness:Qwen2.5 family across FP16, 8-bit, and 4-bit precision shows size-dependent tolerance to compression.
  • Figure 2: Benchmark paradox:Qwen2.5 family performance on GSM8K, GSM-Plus, and basic math reasoning shows significant discrepancies across models.
  • Figure 3: Reasoning Pattern Case Studies:(a) Extreme overthinking with 31x token waste through 3x verification. (b) Pathological failure showing stopping mechanism breakdown ($\infty$ character repetition). (c) Helpful reasoning where 3.5x verbosity provides verifiable steps. (d) Self-contradiction loops. Complete case study in Appendix \ref{['app:case_studies']}.
  • Figure 4: Reasoning budget analysis across Gemini, GPT-5, and O-series models. Increasing the reasoning budget yields minimal gains, showing diminishing returns and near-plateau performance at higher effort levels.

Theorems & Definitions (67)

  • Definition 1: Deterministic Arithmetic Task
  • Definition 2: Overthinking
  • Definition 3: Token Efficiency
  • Definition 4: Overthinking Score
  • Theorem 1: Complete Properties of Overthinking Score
  • proof
  • Theorem 2: Sensitivity Properties
  • proof
  • Corollary 1: Improvement Incentives
  • Theorem 3: Concavity Properties
  • ...and 57 more