Table of Contents
Fetching ...

Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?

Chenrui Fan, Ming Li, Lichao Sun, Tianyi Zhou

TL;DR

Missing Premise exacerbates Overthinking reveals a systematic failure mode in reasoning LLMs: when premises are missing, reasoning models generate dramatically longer, unproductive lines of thought, while non-reasoning models abstain quickly. The authors formalize MiP, build four MiP data suites, and evaluate diverse models to uncover patterns in token usage, abstention behavior, and candidate explanations. They show that current RL/SFT training pipelines encourage lengthy reasoning and can even propagate this behavior through distillation, challenging the validity of test-time scaling assumptions. The work provides datasets, metrics, and insights aimed at fostering more efficient, critically thinking AI that can recognize ill-posed queries and abstain when needed.

Abstract

We find that the response length of reasoning LLMs, whether trained by reinforcement learning or supervised learning, drastically increases for ill-posed questions with missing premises (MiP), ending up with redundant and ineffective thinking. This newly introduced scenario exacerbates the general overthinking issue to a large extent, which we name as the MiP-Overthinking. Such failures are against the ``test-time scaling law'' but have been widely observed on multiple datasets we curated with MiP, indicating the harm of cheap overthinking and a lack of critical thinking. Surprisingly, LLMs not specifically trained for reasoning exhibit much better performance on the MiP scenario, producing much shorter responses that quickly identify ill-posed queries. This implies a critical flaw of the current training recipe for reasoning LLMs, which does not encourage efficient thinking adequately, leading to the abuse of thinking patterns. To further investigate the reasons behind such failures, we conduct fine-grained analyses of the reasoning length, overthinking patterns, and location of critical thinking on different types of LLMs. Moreover, our extended ablation study reveals that the overthinking is contagious through the distillation of reasoning models' responses. These results improve the understanding of overthinking and shed novel insights into mitigating the problem.

Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?

TL;DR

Missing Premise exacerbates Overthinking reveals a systematic failure mode in reasoning LLMs: when premises are missing, reasoning models generate dramatically longer, unproductive lines of thought, while non-reasoning models abstain quickly. The authors formalize MiP, build four MiP data suites, and evaluate diverse models to uncover patterns in token usage, abstention behavior, and candidate explanations. They show that current RL/SFT training pipelines encourage lengthy reasoning and can even propagate this behavior through distillation, challenging the validity of test-time scaling assumptions. The work provides datasets, metrics, and insights aimed at fostering more efficient, critically thinking AI that can recognize ill-posed queries and abstain when needed.

Abstract

We find that the response length of reasoning LLMs, whether trained by reinforcement learning or supervised learning, drastically increases for ill-posed questions with missing premises (MiP), ending up with redundant and ineffective thinking. This newly introduced scenario exacerbates the general overthinking issue to a large extent, which we name as the MiP-Overthinking. Such failures are against the ``test-time scaling law'' but have been widely observed on multiple datasets we curated with MiP, indicating the harm of cheap overthinking and a lack of critical thinking. Surprisingly, LLMs not specifically trained for reasoning exhibit much better performance on the MiP scenario, producing much shorter responses that quickly identify ill-posed queries. This implies a critical flaw of the current training recipe for reasoning LLMs, which does not encourage efficient thinking adequately, leading to the abuse of thinking patterns. To further investigate the reasons behind such failures, we conduct fine-grained analyses of the reasoning length, overthinking patterns, and location of critical thinking on different types of LLMs. Moreover, our extended ablation study reveals that the overthinking is contagious through the distillation of reasoning models' responses. These results improve the understanding of overthinking and shed novel insights into mitigating the problem.

Paper Structure

This paper contains 25 sections, 3 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Illustration of MiP-Overthinking. When queried by questions with missing premises, the response length of reasoning models increases excessively, and they cannot abstain from answering with MiP identified. The left shows a query with an undefined variable, while the right compares a well-defined GSM8K question with its MiP variant (with a critical numerical condition removed). Reasoning models' responses to MiP questions are much longer than those for well-defined questions and those generated by non-reasoning models. The left corner of each response report the response length and thinking time by DeepSeek-R1.
  • Figure 2: Response lengths, accuracy on well-defined questions, and abstain rate of reasoning/non-reasoning models on MiP questions from our MiP-GSM8K dataset. (1) Existing reasoning models generate significantly longer responses for MiP questions than well-defined questions, while non-reasoning models generate responses of similar lengths for both types of questions, indicating MiP-Overthinking for reasoning models. (2) For both questions, reasoning models generate longer responses than non-reasoning models, indicating General Overthinking. (3) Although the longer responses by reasoning models slightly improve the accuracy for well-defined questions, it does not enhance the abstain rate for MiP questions, indicating a contradiction on the test-time scaling law.
  • Figure 3: The step-level similarity heatmaps for s1.1 responses towards well-defined (left) and MiP (right) questions in MiP-GSM8K dataset. To avoid differences in matrix size, we only consider responses with more than 50 steps and visualize the average simialrity matrix across first 50 steps. The heatmap for MiP questions has a higher averaged similarity and lower standard variance, also shown in the heatmap, which indicates the considerable redundancy in its content when responding to MiP questions.
  • Figure 4: An example of reasoning model (s1.1-32B) response to a MiP question. The response exhibits five distinct thinking patterns, highlighted in different colors: ①Revisit Question (yellow), where the model reexamines the original query; ②Visit Knowledge (red), where the model accesses domain-specific knowledge; ③Propose Assumption (blue), where the model proposes and investigates various hypotheses; ④Self Doubt (green), where the model questions its own reasoning and expresses uncertainty; and ⑤Pause/Check (purple), where the model pauses to review previous steps. These patterns demonstrate the model's complex but potentially inefficient reasoning process when confronted with missing premises.
  • Figure 5: The transition flow between in-process suspicion of MiP and the final successful abstention on different reasoning models. For each Sankey diagram, the left bars represent whether the model suspects the given question is unsolvable during its thinking process, i.e., Suspected or Unsuspected; the right bars represent the final abstention, categorized into Abstain (preferred) or Non-abstain. Most existing reasoning models have suspected that the given question might be unsolvable, but only for a very small portion, the models insist on their suspicion.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Definition 1: Missing Premise Problem