Table of Contents
Fetching ...

Do LLMs Really Need 10+ Thoughts for "Find the Time 1000 Days Later"? Towards Structural Understanding of LLM Overthinking

Xinliang Frederick Zhang, Anhad Mohananey, Alexandra Chronopoulou, Pinelopi Papalampidi, Somit Gupta, Tsendsuren Munkhdalai, Lu Wang, Shyam Upadhyay

TL;DR

This work introduces TRACE, a fine-grained analyzer that reconstructs LLM reasoning into sub-thoughts and progression graphs to study inner thought processes. It reveals that long-form thinking yields substantial inference-time costs (5–20× slower) on simple tasks with little accuracy gain, and identifies two dominant thought-progress patterns—Explorer and Late Landing—driving overthinking through over-exploration and over-verification. A utility-based redefinition of overthinking, grounded in the convergence point of the thought process, enables principled management and real-time heuristics (e.g., self-looping and backtracking) to reduce computation without sacrificing performance. Collectively, TRACE provides a structural lens on LLM reasoning, offering actionable insights for reducing inefficiency and guiding future research on managing overthinking in large language models.

Abstract

Models employing long chain-of-thought (CoT) reasoning have shown superior performance on complex reasoning tasks. Yet, this capability introduces a critical and often overlooked inefficiency -- overthinking -- models often engage in unnecessarily extensive reasoning even for simple queries, incurring significant computations without accuracy improvements. While prior work has explored solutions to mitigate overthinking, a fundamental gap remains in our understanding of its underlying causes. Most existing analyses are limited to superficial, profiling-based observations, failing to delve into LLMs' inner workings. This study introduces a systematic, fine-grained analyzer of LLMs' thought process to bridge the gap, TRACE. We first benchmark the overthinking issue, confirming that long-thinking models are five to twenty times slower on simple tasks with no substantial gains. We then use TRACE to first decompose the thought process into minimally complete sub-thoughts. Next, by inferring discourse relationships among sub-thoughts, we construct granular thought progression graphs and subsequently identify common thinking patterns for topically similar queries. Our analysis reveals two major patterns for open-weight thinking models -- Explorer and Late Landing. This finding provides evidence that over-verification and over-exploration are the primary drivers of overthinking in LLMs. Grounded in thought structures, we propose a utility-based definition of overthinking, which moves beyond length-based metrics. This revised definition offers a more insightful understanding of LLMs' thought progression, as well as practical guidelines for principled overthinking management.

Do LLMs Really Need 10+ Thoughts for "Find the Time 1000 Days Later"? Towards Structural Understanding of LLM Overthinking

TL;DR

This work introduces TRACE, a fine-grained analyzer that reconstructs LLM reasoning into sub-thoughts and progression graphs to study inner thought processes. It reveals that long-form thinking yields substantial inference-time costs (5–20× slower) on simple tasks with little accuracy gain, and identifies two dominant thought-progress patterns—Explorer and Late Landing—driving overthinking through over-exploration and over-verification. A utility-based redefinition of overthinking, grounded in the convergence point of the thought process, enables principled management and real-time heuristics (e.g., self-looping and backtracking) to reduce computation without sacrificing performance. Collectively, TRACE provides a structural lens on LLM reasoning, offering actionable insights for reducing inefficiency and guiding future research on managing overthinking in large language models.

Abstract

Models employing long chain-of-thought (CoT) reasoning have shown superior performance on complex reasoning tasks. Yet, this capability introduces a critical and often overlooked inefficiency -- overthinking -- models often engage in unnecessarily extensive reasoning even for simple queries, incurring significant computations without accuracy improvements. While prior work has explored solutions to mitigate overthinking, a fundamental gap remains in our understanding of its underlying causes. Most existing analyses are limited to superficial, profiling-based observations, failing to delve into LLMs' inner workings. This study introduces a systematic, fine-grained analyzer of LLMs' thought process to bridge the gap, TRACE. We first benchmark the overthinking issue, confirming that long-thinking models are five to twenty times slower on simple tasks with no substantial gains. We then use TRACE to first decompose the thought process into minimally complete sub-thoughts. Next, by inferring discourse relationships among sub-thoughts, we construct granular thought progression graphs and subsequently identify common thinking patterns for topically similar queries. Our analysis reveals two major patterns for open-weight thinking models -- Explorer and Late Landing. This finding provides evidence that over-verification and over-exploration are the primary drivers of overthinking in LLMs. Grounded in thought structures, we propose a utility-based definition of overthinking, which moves beyond length-based metrics. This revised definition offers a more insightful understanding of LLMs' thought progression, as well as practical guidelines for principled overthinking management.

Paper Structure

This paper contains 37 sections, 41 figures, 10 tables.

Figures (41)

  • Figure 1: Performance and inference-time efficiency trends of Qwen3 models at different scales on simple math reasoning. We find that additional thinking becomes ineffective once the model scale is above the threshold of 4B. Plots for other tasks such as temporal and logical reasoning, as well as knowledge recall, are in \ref{['appx:horizontal']}.
  • Figure 2: Overview of our proposed analyzer (TRACE) to study the inner workings of an LLM's thought process. It contains four main stages (detailed in \ref{['sec:analysis_framework']}): Response Sampling, Thought Decomposition & Label Inference, Progression Graph Construction, and Thought Pattern Induction.
  • Figure 3: Individual thought progression graph of Qwen3-235B-A22B model on a sampled date arithmetic (temporal-L3) query. Red bubble denotes the ground-truth answer, while the red dashed circle denotes the final delivered answer.
  • Figure 4: The typical Explorer thought progression pattern (5 distinct answer case). The size of the blue nodes indicates the visit frequency, while the size of the red nodes (and associated values) indicates the probability of the ground truth being present at that node. Due to the exploratory behavior, multiple reasoning branches emerge and the correct answer can be discovered at any stage of the thought process. Edge thickness indicates the edge frequency, and red dashed curve denotes the occurrence of backtracking, where the model abandon its current reasoning path. More plots in \ref{['appx:more_plots_explorer']}.
  • Figure 5: The typical Late Landing thought progression pattern (5 distinct answer case). The model follows a more linear path, with the probability of the ground-truth answer (indicated by the red node size and value) being highly concentrated at the terminal stage of the thought process. Towards the end, the model engages in over-verification, marked by a thick self-loop, to increase its confidence. More plots in \ref{['appx:more_plots_late_landing']}.
  • ...and 36 more figures