Table of Contents
Fetching ...

The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models

Siqi Fan, Bowen Qin, Peng Han, Shuo Shang, Yequan Wang, Aixin Sun

TL;DR

The paper studies overthinking in large language models during mathematical reasoning and argues that token-efficiency alone is insufficient, as it ignores problem difficulty and intermediate computation. It formalizes reasoning efficiency as a relative metric comparing thinking models to an instruction baseline and introduces CoThink, a two-stage pipeline where an instruct model drafts a concise outline and a thinking model completes verification. Across GSM8K, MATH500, and AIME24, with multiple thinking models, CoThink reduces token use by about $21.1\%$ while preserving accuracy, with stronger gains on harder problems. It further analyzes a hypothesized scaling law $ Q(C) \propto C^\beta $ and identifies algorithmic and data-distribution sources of inefficiency, offering a practical framework for deploying efficient reasoning without sacrificing correctness.

Abstract

Recent thinking models trained with reinforcement learning and backward-checking CoT often suffer from overthinking: they produce excessively long outputs even on simple problems, wasting computation. Existing evaluations, based on token efficiency, give an incomplete view as they neglect problem difficulty and intermediate computation costs. We formalize reasoning efficiency as a relative measure between thinking and instruct models, treating instruct models as the minimal-effort baseline. A systematic study across four thinking models and multiple benchmarks reveals two consistent patterns: (i) instruct models achieve higher efficiency overall, and (ii) problem difficulty affects efficiency, with thinking models wasting computation on easy problems but providing value on harder ones. Building on this insight, we propose COTHINK, a simple two-stage pipeline: an instruct model drafts a brief outline, and a thinking model expands it. On GSM8K, MATH500, and AIME24, COTHINK cuts token usage by 21.1% while keeping accuracy on four thinking models, and remains competitive with strong efficiency baselines.

The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models

TL;DR

The paper studies overthinking in large language models during mathematical reasoning and argues that token-efficiency alone is insufficient, as it ignores problem difficulty and intermediate computation. It formalizes reasoning efficiency as a relative metric comparing thinking models to an instruction baseline and introduces CoThink, a two-stage pipeline where an instruct model drafts a concise outline and a thinking model completes verification. Across GSM8K, MATH500, and AIME24, with multiple thinking models, CoThink reduces token use by about while preserving accuracy, with stronger gains on harder problems. It further analyzes a hypothesized scaling law and identifies algorithmic and data-distribution sources of inefficiency, offering a practical framework for deploying efficient reasoning without sacrificing correctness.

Abstract

Recent thinking models trained with reinforcement learning and backward-checking CoT often suffer from overthinking: they produce excessively long outputs even on simple problems, wasting computation. Existing evaluations, based on token efficiency, give an incomplete view as they neglect problem difficulty and intermediate computation costs. We formalize reasoning efficiency as a relative measure between thinking and instruct models, treating instruct models as the minimal-effort baseline. A systematic study across four thinking models and multiple benchmarks reveals two consistent patterns: (i) instruct models achieve higher efficiency overall, and (ii) problem difficulty affects efficiency, with thinking models wasting computation on easy problems but providing value on harder ones. Building on this insight, we propose COTHINK, a simple two-stage pipeline: an instruct model drafts a brief outline, and a thinking model expands it. On GSM8K, MATH500, and AIME24, COTHINK cuts token usage by 21.1% while keeping accuracy on four thinking models, and remains competitive with strong efficiency baselines.

Paper Structure

This paper contains 25 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustration of token lengths for example questions from AIME 2024, where all models successfully answer all these questions: (a) shows answers by Qwen2.5-32B-Instruct (Instruct LLM) and DeepSeek-R1-distill-Qwen-32B (Thinking LLM) on Q67, (b) plots the total number of tokens in their solutions for 5 questions. Note: Question ID follows the Qwen2.5-Math evaluation format qwen2.5, ranging from Q60 to Q89.
  • Figure 2: Reasoning efficiency comparison between different model. Each model is represented by a specific marker shape, and each dataset by a distinct color. The dashed gray lines correspond to hypothesized efficiency scaling law with assumed scaling exponents $\beta=0.3$ and $\beta=0.5$ for reference.
  • Figure 3: We present five AIME24 questions that the instruct model (Qwen2.5-32B-Instruct) fails to answer on its own. For each question, we prepend thinking episodes generated by the DeepSeek-R1-Distill-Qwen-32B model as context, and test whether this helps the instruct model arrive at the correct answer.
  • Figure 4: An illustration of the CoThink two-stage framework compared with its SoloThink counterparts using either an instruct model or a thinking model.
  • Figure 5: Reasoning efficiency comparison between Solo-Thinking and CoThink.
  • ...and 2 more figures