Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, Yu Meng
TL;DR
This paper tackles the problem that longer CoT explanations do not reliably improve correctness and can waste compute. It introduces the deep-thinking ratio (DTR), which quantifies inference-time thinking by tracking when token predictions stabilize across model layers using intermediate-layer distributions and Jensen-Shannon divergence, with settling criteria defined by a threshold g and depth fraction rac{L}{\rho}. Empirically, DTR exhibits a strong, consistent positive correlation with accuracy across multiple math and science benchmarks and model families, outperforming token-length and confidence-based baselines. Building on this, Think@n uses DTR to selectively aggregate high-quality samples, achieving comparable or better accuracy than standard self-consistency while reducing inference costs by about half, and enabling effective early stopping based on short prefixes. The results highlight the value of measuring internal computation depth rather than surface length, offering a principled path to more reliable and efficient reasoning in large language models.
Abstract
Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.
