Table of Contents
Fetching ...

Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt

Keqin Peng, Liang Ding, Yuanxin Ouyang, Meng Fang, Dacheng Tao

TL;DR

The study analyzes overthinking in Long Chain-of-Thought through a self-doubt lens, revealing self-doubt as a major driver of redundant reasoning in mathematical tasks. It introduces a lightweight prompting method that first assesses input validity and then yields a concise answer, reducing token usage while preserving or improving accuracy. Across three well-defined math benchmarks and four Missing-Premise datasets, the approach achieves about a $1/3$ average reduction in tokens, with larger gains (up to $66\%$) in challenging cases, and substantially improves abstain rates on ill-posed problems. These findings offer a practical path to more efficient, reliable reasoning in RLLMs, while acknowledging limitations related to domain scope and evaluator biases.

Abstract

Reasoning Large Language Models (RLLMs) have demonstrated impressive performance on complex tasks, largely due to the adoption of Long Chain-of-Thought (Long CoT) reasoning. However, they often exhibit overthinking -- performing unnecessary reasoning steps even after arriving at the correct answer. Prior work has largely focused on qualitative analyses of overthinking through sample-based observations of long CoTs. In contrast, we present a quantitative analysis of overthinking from the perspective of self-doubt, characterized by excessive token usage devoted to re-verifying already-correct answer. We find that self-doubt significantly contributes to overthinking. In response, we introduce a simple and effective prompting method to reduce the model's over-reliance on input questions, thereby avoiding self-doubt. Specifically, we first prompt the model to question the validity of the input question, and then respond concisely based on the outcome of that evaluation. Experiments on three mathematical reasoning tasks and four datasets with missing premises demonstrate that our method substantially reduces answer length and yields significant improvements across nearly all datasets upon 4 widely-used RLLMs. Further analysis demonstrates that our method effectively minimizes the number of reasoning steps and reduces self-doubt.

Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt

TL;DR

The study analyzes overthinking in Long Chain-of-Thought through a self-doubt lens, revealing self-doubt as a major driver of redundant reasoning in mathematical tasks. It introduces a lightweight prompting method that first assesses input validity and then yields a concise answer, reducing token usage while preserving or improving accuracy. Across three well-defined math benchmarks and four Missing-Premise datasets, the approach achieves about a average reduction in tokens, with larger gains (up to ) in challenging cases, and substantially improves abstain rates on ill-posed problems. These findings offer a practical path to more efficient, reliable reasoning in RLLMs, while acknowledging limitations related to domain scope and evaluator biases.

Abstract

Reasoning Large Language Models (RLLMs) have demonstrated impressive performance on complex tasks, largely due to the adoption of Long Chain-of-Thought (Long CoT) reasoning. However, they often exhibit overthinking -- performing unnecessary reasoning steps even after arriving at the correct answer. Prior work has largely focused on qualitative analyses of overthinking through sample-based observations of long CoTs. In contrast, we present a quantitative analysis of overthinking from the perspective of self-doubt, characterized by excessive token usage devoted to re-verifying already-correct answer. We find that self-doubt significantly contributes to overthinking. In response, we introduce a simple and effective prompting method to reduce the model's over-reliance on input questions, thereby avoiding self-doubt. Specifically, we first prompt the model to question the validity of the input question, and then respond concisely based on the outcome of that evaluation. Experiments on three mathematical reasoning tasks and four datasets with missing premises demonstrate that our method substantially reduces answer length and yields significant improvements across nearly all datasets upon 4 widely-used RLLMs. Further analysis demonstrates that our method effectively minimizes the number of reasoning steps and reduces self-doubt.

Paper Structure

This paper contains 20 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: The proportion of overthinking and self-doubt across three different datasets with Deepseek-Deepseek-R1-Distill-Qwen-32B. SD: Self-Doubt, OT w/o SD: Overthinking without Self-Doubt, NOT: Non-Overtinking.
  • Figure 2: The average thinking steps across three mathematical reasoning tasks with DeepSeek-R1-Distill-Qwen-32B.