Table of Contents
Fetching ...

Statistical Early Stopping for Reasoning Models

Yangxinyu Xie, Tao Wang, Soham Mallick, Yan Sun, Georgy Noarov, Mengxin Yu, Tanwi Mallick, Weijie J. Su, Edgar Dobriban

TL;DR

The paper tackles the reliability of reasoning in large language models by mitigating overthinking on ill-posed queries through principled early-stopping rules. It introduces two statistically principled methods: a parametric renewal-process stopping rule and a nonparametric maxwise conformal stopping rule, both calibrated on well-posed traces and applicable to black-box LLMs. An interpretable uncertainty keyword lexicon is constructed to signal uncertainty, and finite-sample guarantees on false positives are provided alongside empirical gains in efficiency and reliability across math and scientific reasoning tasks. The work demonstrates robustness to distribution shifts and offers a practical, interpretable framework for stopping reasoning traces without requiring access to internal model activations or expensive retraining.

Abstract

While LLMs have seen substantial improvement in reasoning capabilities, they also sometimes overthink, generating unnecessary reasoning steps, particularly under uncertainty, given ill-posed or ambiguous queries. We introduce statistically principled early stopping methods that monitor uncertainty signals during generation to mitigate this issue. Our first approach is parametric: it models inter-arrival times of uncertainty keywords as a renewal process and applies sequential testing for stopping. Our second approach is nonparametric and provides finite-sample guarantees on the probability of halting too early on well-posed queries. We conduct empirical evaluations on reasoning tasks across several domains and models. Our results indicate that uncertainty-aware early stopping can improve both efficiency and reliability in LLM reasoning, and we observe especially significant gains for math reasoning.

Statistical Early Stopping for Reasoning Models

TL;DR

The paper tackles the reliability of reasoning in large language models by mitigating overthinking on ill-posed queries through principled early-stopping rules. It introduces two statistically principled methods: a parametric renewal-process stopping rule and a nonparametric maxwise conformal stopping rule, both calibrated on well-posed traces and applicable to black-box LLMs. An interpretable uncertainty keyword lexicon is constructed to signal uncertainty, and finite-sample guarantees on false positives are provided alongside empirical gains in efficiency and reliability across math and scientific reasoning tasks. The work demonstrates robustness to distribution shifts and offers a practical, interpretable framework for stopping reasoning traces without requiring access to internal model activations or expensive retraining.

Abstract

While LLMs have seen substantial improvement in reasoning capabilities, they also sometimes overthink, generating unnecessary reasoning steps, particularly under uncertainty, given ill-posed or ambiguous queries. We introduce statistically principled early stopping methods that monitor uncertainty signals during generation to mitigate this issue. Our first approach is parametric: it models inter-arrival times of uncertainty keywords as a renewal process and applies sequential testing for stopping. Our second approach is nonparametric and provides finite-sample guarantees on the probability of halting too early on well-posed queries. We conduct empirical evaluations on reasoning tasks across several domains and models. Our results indicate that uncertainty-aware early stopping can improve both efficiency and reliability in LLM reasoning, and we observe especially significant gains for math reasoning.
Paper Structure (42 sections, 2 theorems, 2 equations, 12 figures, 15 tables, 2 algorithms)

This paper contains 42 sections, 2 theorems, 2 equations, 12 figures, 15 tables, 2 algorithms.

Key Result

Proposition 2.1

Let $M_i = \max_{j:L_j \le |T^{(i)}|} u_i(L_j)$ for calibration traces, and $M_{n+1}$ for the test trace. Define $\tau^\star$ as in Algorithm alg:maxwise. Then

Figures (12)

  • Figure 1: An illustration of uncertainty keyword arrival times, where inter-arrival gaps (e.g., 9 and 14 tokens) are highlighted to motivate our renewal-process–based stopping rule.
  • Figure 2: Workflow for extraction, calibration, and testing of the stopping rule.
  • Figure 3: Quantiles of reasoning trace lengths (log scale) for well-posed problems across math benchmarks and two model families. Each panel corresponds to a model, and each curve shows the distribution of trace lengths on a benchmark, summarized by the minimum, 25th percentile, median, 75th percentile, and maximum. Substantial variation across datasets reveals a clear distribution shift in reasoning length, even when all queries are answerable.
  • Figure 4: Comparison of early stopping rates between our proposed methods and the oracle upper bound across math reasoning benchmarks, where different shapes and colors correspond to different datasets and models, respectively. The regression slopes are 0.8238 and 0.8532, respectively, for Renewal and Maxwise.
  • Figure 5: Example of context removal in AbstentionBench's GSM8K subset.
  • ...and 7 more figures

Theorems & Definitions (3)

  • Proposition 2.1
  • Proposition A.1
  • proof