Table of Contents
Fetching ...

OptimalThinkingBench: Evaluating Over and Underthinking in LLMs

Pranjal Aggarwal, Seungone Kim, Jack Lanchantin, Sean Welleck, Jason Weston, Ilia Kulikov, Swarnadeep Saha

TL;DR

This work tackles the challenge of balancing accuracy and efficiency in LLM reasoning by introducing OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking. It defines two sub-benchmarks, OverthinkingBench and UnderthinkingBench, and introduces metrics including Overthinking-Adjusted Accuracy ($AUC_{\text{OAA}}$) and the combined $F_1^{\text{otb}}$. Evaluating 33 models, it shows no model achieves an optimal balance; thinking models waste tokens on simple tasks while non-thinking models underperform on hard reasoning. The work also assesses multiple training-time and inference-time approaches to promote optimal thinking, finding trade-offs across sub-benchmarks and highlighting the need for better unified models.

Abstract

Thinking LLMs solve complex tasks at the expense of increased compute and overthinking on simpler problems, while non-thinking LLMs are faster and cheaper but underthink on harder reasoning problems. This has led to the development of separate thinking and non-thinking LLM variants, leaving the onus of selecting the optimal model for each query on the end user. We introduce OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking in LLMs and also encourages the development of optimally-thinking models that balance performance and efficiency. Our benchmark comprises two sub-benchmarks: OverthinkingBench, featuring simple math and general queries in 72 domains, and UnderthinkingBench, containing 11 challenging reasoning tasks along with harder math problems. Using novel thinking-adjusted accuracy metrics, we extensively evaluate 33 different thinking and non-thinking models and show that no model is able to optimally think on our benchmark. Thinking models often overthink for hundreds of tokens on the simplest user queries without improving performance. In contrast, large non-thinking models underthink, often falling short of much smaller thinking models. We further explore several methods to encourage optimal thinking, but find that these approaches often improve on one sub-benchmark at the expense of the other, highlighting the need for better unified and optimal models in the future.

OptimalThinkingBench: Evaluating Over and Underthinking in LLMs

TL;DR

This work tackles the challenge of balancing accuracy and efficiency in LLM reasoning by introducing OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking. It defines two sub-benchmarks, OverthinkingBench and UnderthinkingBench, and introduces metrics including Overthinking-Adjusted Accuracy () and the combined . Evaluating 33 models, it shows no model achieves an optimal balance; thinking models waste tokens on simple tasks while non-thinking models underperform on hard reasoning. The work also assesses multiple training-time and inference-time approaches to promote optimal thinking, finding trade-offs across sub-benchmarks and highlighting the need for better unified models.

Abstract

Thinking LLMs solve complex tasks at the expense of increased compute and overthinking on simpler problems, while non-thinking LLMs are faster and cheaper but underthink on harder reasoning problems. This has led to the development of separate thinking and non-thinking LLM variants, leaving the onus of selecting the optimal model for each query on the end user. We introduce OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking in LLMs and also encourages the development of optimally-thinking models that balance performance and efficiency. Our benchmark comprises two sub-benchmarks: OverthinkingBench, featuring simple math and general queries in 72 domains, and UnderthinkingBench, containing 11 challenging reasoning tasks along with harder math problems. Using novel thinking-adjusted accuracy metrics, we extensively evaluate 33 different thinking and non-thinking models and show that no model is able to optimally think on our benchmark. Thinking models often overthink for hundreds of tokens on the simplest user queries without improving performance. In contrast, large non-thinking models underthink, often falling short of much smaller thinking models. We further explore several methods to encourage optimal thinking, but find that these approaches often improve on one sub-benchmark at the expense of the other, highlighting the need for better unified and optimal models in the future.

Paper Structure

This paper contains 27 sections, 3 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: OptimalThinkingBench: A unified benchmark to evaluate overthinking and underthinking in LLMs. OverthinkingBench consists of simpler queries where excessive thinking either does not improve or occasionally degrades performance. UnderthinkingBench consists of reasoning problems where lack of thinking hurts performance.
  • Figure 2: Visualization of $\text{AUC}_\text{OAA}$ metric showing Overthinking-Adjusted Accuracy ($\text{OAA}_t$) versus thinking token threshold $t$. We illustrate with three model types: a Non-thinking model (red) achieves constant 70% accuracy from t=0; an Overthinking model (orange) overthinks even on simple problems, decreasing $\text{AUC}_\text{OAA}$; and an Optimal Thinking model (blue) thinks fast on simple problems while spending more compute on harder problems, achieving better $\text{AUC}_\text{OAA}$. Shaded areas represent $\text{AUC}_\text{OAA}$ values. The ranking: $\text{AUC}_\text{OAA}^\text{optimal} > \text{AUC}_\text{OAA}^{\text{non-think}} > \text{AUC}_\text{OAA}^{\text{overthink}}$.
  • Figure 3: Comparison of overthinking metrics on OvT-Math and OvT-General. Math questions invoke greater overthinking than general-domain ones.
  • Figure 4: Generation recipe of OverthinkingBench (Step 1 and 2) and evaluation recipe of models on OverthinkingBench (Step 3). We follow a generation and filtering pipeline to generate and verify the questions and answer correctness. We evaluate model outputs on this benchmark based on the number of tokens used (overthinking) and answer correctness, using an LLM-as-a-Judge verifier.
  • Figure 5: Generation recipe of UnderthinkingBench (Step 1) and evaluation recipe of models on UnderthinkingBench (Step 2). We follow a generation and filtering pipeline to first generate and then check for reasoning tasks that particularly benefit from thinking (by leveraging the difference between a small thinking model and a large non-thinking model). We evaluate models on UnderthinkingBench using accuracy computed with a code-based verifier.
  • ...and 6 more figures