The Evolution of Thought: Tracking LLM Overthinking via Reasoning Dynamics Analysis

Zihao Wei; Liang Pang; Jiahao Liu; Wenjie Shi; Jingcheng Deng; Shicheng Xu; Zenghao Duan; Fei Sun; Huawei Shen; Xueqi Cheng

The Evolution of Thought: Tracking LLM Overthinking via Reasoning Dynamics Analysis

Zihao Wei, Liang Pang, Jiahao Liu, Wenjie Shi, Jingcheng Deng, Shicheng Xu, Zenghao Duan, Fei Sun, Huawei Shen, Xueqi Cheng

TL;DR

This work identifies two intertwined Reasoning Dynamics in LLMs: Thinking-Content Compensation (length-based trade-offs) and Semantic Path Convergence (latent-space stabilization). It defines the instance-specific Reasoning Completion Point (RCP) marking when further thinking becomes redundant, and presents RCPD, an inference-time, rank-based detector that terminates thinking at the RCP to curb overthinking. Across AIME and GPQA benchmarks on Qwen3 and DeepSeek-R1, RCPD achieves substantial token reductions (up to 44%) with minimal accuracy loss, demonstrating a principled, training-free approach to efficient test-time scaling. The findings highlight the importance of adaptive early stopping in reasoning processes and offer a practical tool for robust, efficient deployment of advanced LLM reasoning systems.

Abstract

Test-time scaling via explicit reasoning trajectories significantly boosts large language model (LLM) performance but often triggers overthinking. To explore this, we analyze reasoning through two lenses: Reasoning Length Dynamics, which reveals a compensatory trade-off between thinking and answer content length that eventually leads to thinking redundancy, and Reasoning Semantic Dynamics, which identifies semantic convergence and repetitive oscillations. These dynamics uncover an instance-specific Reasoning Completion Point (RCP), beyond which computation continues without further performance gain. Since the RCP varies across instances, we propose a Reasoning Completion Point Detector (RCPD), an inference-time early-exit method that identifies the RCP by monitoring the rank dynamics of termination tokens (e.g., </think>). Across AIME and GPQA benchmarks using Qwen3 and DeepSeek-R1, RCPD reduces token usage by up to 44% while preserving accuracy, offering a principled approach to efficient test-time scaling.

The Evolution of Thought: Tracking LLM Overthinking via Reasoning Dynamics Analysis

TL;DR

Abstract

The Evolution of Thought: Tracking LLM Overthinking via Reasoning Dynamics Analysis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)