Table of Contents
Fetching ...

TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

Alliot Nagle, Jakhongir Saydaliev, Dhia Garbaya, Michael Gastpar, Ashok Vardhan Makkuva, Hyeji Kim

Abstract

Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to generate intermediate thinking tokens before arriving at the final answer. However, LRMs often suffer from significant overthinking, spending excessive compute time even after the answer is generated early on. Prior work has identified the existence of an optimal reasoning length such that truncating reasoning at this point significantly shortens CoT outputs with virtually no change in performance. However, determining optimal CoT lengths for practical datasets is highly non-trivial as they are fully task and model-dependent. In this paper, we precisely address this and design TERMINATOR, an early-exit strategy for LRMs at inference to mitigate overthinking. The central idea underpinning TERMINATOR is that the first arrival of an LRM's final answer is often predictable, and we leverage these first answer positions to create a novel dataset of optimal reasoning lengths to train TERMINATOR. Powered by this approach, TERMINATOR achieves significant reductions in CoT lengths of 14%-55% on average across four challenging practical datasets: MATH-500, AIME 2025, HumanEval, and GPQA, whilst outperforming current state-of-the-art methods.

TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

Abstract

Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to generate intermediate thinking tokens before arriving at the final answer. However, LRMs often suffer from significant overthinking, spending excessive compute time even after the answer is generated early on. Prior work has identified the existence of an optimal reasoning length such that truncating reasoning at this point significantly shortens CoT outputs with virtually no change in performance. However, determining optimal CoT lengths for practical datasets is highly non-trivial as they are fully task and model-dependent. In this paper, we precisely address this and design TERMINATOR, an early-exit strategy for LRMs at inference to mitigate overthinking. The central idea underpinning TERMINATOR is that the first arrival of an LRM's final answer is often predictable, and we leverage these first answer positions to create a novel dataset of optimal reasoning lengths to train TERMINATOR. Powered by this approach, TERMINATOR achieves significant reductions in CoT lengths of 14%-55% on average across four challenging practical datasets: MATH-500, AIME 2025, HumanEval, and GPQA, whilst outperforming current state-of-the-art methods.
Paper Structure (21 sections, 4 equations, 31 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 4 equations, 31 figures, 4 tables, 1 algorithm.

Figures (31)

  • Figure 1: Early stopping via Terminator.Terminator is a binary probe classifier that predicts whether to exit or not at every CoT token. Once the majority of prediction bits within a window ($10$ here) are $1$, </think> is injected into the LRM's token stream to stop thinking (\ref{['sec:methods']}).
  • Figure 2: Event-Locked Averaging of Token-Confidence. Event-locked averaging shows a consistent agreement on spiking behavior at the answer position in each CoT, but disagrees elsewhere. On the other hand, this phenomenon is not readily observable in the single-sample case. Figures on the left show the Token-Confidencefu2025deepthink and log-probability trajectories throughout reasoning for a single, randomly selected sample; figures on the right show the effect of event-locked averaging on the position of the first arrival of the final answer across all CoTs. The 3200 CoTs used are a random subset of our training set, which combines AIME (1983--2024), MATH, OpenCoder-SFT, and OpenScience. \ref{['fig:aligned-timeseries-aime', 'fig:aligned-timeseries-math', 'fig:aligned-timeseries-opencoder', 'fig:aligned-timeseries-openscience']} in \ref{['app:additional_experiments']} show similar trends for each dataset separately. Note that the Standard Error shown here as a shaded region is not readily noticeable but is more apparent with further zooming in.
  • Figure 3: Token Usage Frequency Shift. "Thinking token" usage changes depending on whether the final answer has been generated in the CoT. Rates are computed by counting the raw number of occurrences of the token before and after the answer, and then normalizing each count by the respective number of tokens in the before and after bins. The arrival of the final answer is hinted at by changes in the rates for these tokens. The relative length of a CoT is captured by its dot size, where a longer CoT has a larger dot. \ref{['app:additional_experiments']} demonstrates similar results for other "thinking tokens" in \ref{['fig:token-scatter-all']} and for each data source in \ref{['fig:token-scatter-aime', 'fig:token-scatter-math', 'fig:token-scatter-opencoder', 'fig:token-scatter-openscience']}.
  • Figure 4: Training-Dataset Curation Process. We use an LRM to (1) extract final answer $\hat{\boldsymbol{a}}$ from final solution $\boldsymbol{s}$, (2) identify the earliest position of $\hat{\boldsymbol{a}}$ in the CoT$\boldsymbol{r}$, and (3) verify that the position was correct. If it was, then we can extract the exact position of $\hat{\boldsymbol{a}}$ from the CoT at the final token-index extraction step; otherwise, we retry the identification step with feedback.
  • Figure 5: OOD Performance of Terminator. The best trade-off between accuracy and compression rate is achieved when the evaluation set is in-distribution with the training dataset. Here the out-of-distribution performance of Terminator with respect to the compression rate (left) and the accuracy (right) for Qwen3-8B is shown. Training datasets are listed along the row axis, and the evaluation sets are listed across the column axis. For example, training Terminator on MATH and evaluating on HumanEval yields a compression rate of $67\%$ and an accuracy of $83\%$. Every training dataset has an in-domain evaluation dataset, i.e. MATH $\rightarrow$ MATH-500, AIME 1983--2024 $\rightarrow$ AIME25, OpenCoder-SFT $\rightarrow$ HumanEval, and OpenScience $\rightarrow$ GPQA.
  • ...and 26 more figures