Statistical Early Stopping for Reasoning Models
Yangxinyu Xie, Tao Wang, Soham Mallick, Yan Sun, Georgy Noarov, Mengxin Yu, Tanwi Mallick, Weijie J. Su, Edgar Dobriban
TL;DR
The paper tackles the reliability of reasoning in large language models by mitigating overthinking on ill-posed queries through principled early-stopping rules. It introduces two statistically principled methods: a parametric renewal-process stopping rule and a nonparametric maxwise conformal stopping rule, both calibrated on well-posed traces and applicable to black-box LLMs. An interpretable uncertainty keyword lexicon is constructed to signal uncertainty, and finite-sample guarantees on false positives are provided alongside empirical gains in efficiency and reliability across math and scientific reasoning tasks. The work demonstrates robustness to distribution shifts and offers a practical, interpretable framework for stopping reasoning traces without requiring access to internal model activations or expensive retraining.
Abstract
While LLMs have seen substantial improvement in reasoning capabilities, they also sometimes overthink, generating unnecessary reasoning steps, particularly under uncertainty, given ill-posed or ambiguous queries. We introduce statistically principled early stopping methods that monitor uncertainty signals during generation to mitigate this issue. Our first approach is parametric: it models inter-arrival times of uncertainty keywords as a renewal process and applies sequential testing for stopping. Our second approach is nonparametric and provides finite-sample guarantees on the probability of halting too early on well-posed queries. We conduct empirical evaluations on reasoning tasks across several domains and models. Our results indicate that uncertainty-aware early stopping can improve both efficiency and reliability in LLM reasoning, and we observe especially significant gains for math reasoning.
