Table of Contents
Fetching ...

Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, Yu Meng

TL;DR

This paper tackles the problem that longer CoT explanations do not reliably improve correctness and can waste compute. It introduces the deep-thinking ratio (DTR), which quantifies inference-time thinking by tracking when token predictions stabilize across model layers using intermediate-layer distributions and Jensen-Shannon divergence, with settling criteria defined by a threshold g and depth fraction rac{L}{\rho}. Empirically, DTR exhibits a strong, consistent positive correlation with accuracy across multiple math and science benchmarks and model families, outperforming token-length and confidence-based baselines. Building on this, Think@n uses DTR to selectively aggregate high-quality samples, achieving comparable or better accuracy than standard self-consistency while reducing inference costs by about half, and enabling effective early stopping based on short prefixes. The results highlight the value of measuring internal computation depth rather than surface length, offering a principled path to more reliable and efficient reasoning in large language models.

Abstract

Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.

Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

TL;DR

This paper tackles the problem that longer CoT explanations do not reliably improve correctness and can waste compute. It introduces the deep-thinking ratio (DTR), which quantifies inference-time thinking by tracking when token predictions stabilize across model layers using intermediate-layer distributions and Jensen-Shannon divergence, with settling criteria defined by a threshold g and depth fraction rac{L}{\rho}. Empirically, DTR exhibits a strong, consistent positive correlation with accuracy across multiple math and science benchmarks and model families, outperforming token-length and confidence-based baselines. Building on this, Think@n uses DTR to selectively aggregate high-quality samples, achieving comparable or better accuracy than standard self-consistency while reducing inference costs by about half, and enabling effective early stopping based on short prefixes. The results highlight the value of measuring internal computation depth rather than surface length, offering a principled path to more reliable and efficient reasoning in large language models.

Abstract

Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.
Paper Structure (33 sections, 21 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 33 sections, 21 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of correlations between accuracy and proxies for thinking effort. The plots illustrate the relationship between model performance and two inference-time measures of thinking effort on GPT-OSS-120B-medium across AIME 2024/2025, HMMT 2025, and GPQA-Diamond. (Left) Output token count exhibits a moderate negative correlation (average $r = -0.544$), suggesting that output length is an unreliable indicator of performance. (Right) In contrast, our proposed deep-thinking ratio demonstrates a strong positive correlation with accuracy (average $r = 0.828$).
  • Figure 2: Heatmap of thought: We plot the Jensen–Shannon divergence (JSD) values between the distributions of the last (36th) layer and intermediate layers for an answer sequence from GPT-OSS-120B-high. Functional and templated words ( e.g., "and", "is", "boxed", "<|return|>") often converge at relatively shallow layers; Completions after operators ( e.g., "+", "=") and answer tokens/symbols ( e.g., "13", "(D)") do not settle until deeper layers. Interestingly, the answer token "13" gradually surfaces in earlier layers after its first appearance.
  • Figure 3: Illustration of our method of identifying deep-thinking tokens. Suppose a model with 10 layers, by setting the depth fraction $\rho=0.8$, the token is successfully classified as a deep-thinking token at generation step $t$ since its JSD with the final-layer distribution first fall below the threshold $g$ only until it reaches the late-settling regime.
  • Figure 4: Effect of hyper-parameters on thinking effort measurement and accuracy profiles. We analyze the impact of hyper-parameters by sweeping different settling threshold $g$ and depth fraction $\rho$. (a) Varying $g$ has more impacts the correlation; a permissive threshold ($g=0.25$) yields flatter trends, whereas $g=0.5$ provides the most robust positive signal. (b) Varying $\rho$ shifts the range of thinking effort scores but maintains overall consistent positive slopes. Overall, stricter criteria (higher $g$, lower $\rho$) reduce the range of DTR, with $(g, \rho) = (0.5, 0.85)$ offering an ideal balance between stability and correlation.
  • Figure 5: Comparison of the trade-off between task accuracy and inference cost (tokens) with different aggregation methods. Accuracy is averaged across all four datasets (AIME 24/25, HMMT 25, GPQA-D). Our Think@$n$ method achieves the best overall Pareto-optimal performance. It matches or exceeds the accuracy of Cons@n with approximately half the inference cost, while Self-Certainty@$n$ is notably less efficient.
  • ...and 3 more figures