Table of Contents
Fetching ...

One-Token Verification for Reasoning Correctness Estimation

Zhan Zhuang, Xiequn Wang, Zebin Chen, Feiyang Ye, Ying Wei, Kede Ma, Yu Zhang

TL;DR

One-Token Verification (OTV), a computational method that estimates reasoning correctness in a single forward pass during generation, and reduces token usage by up to $90\%$ through correctness-guided early termination, prioritizing shorter, more reliable solutions.

Abstract

Recent breakthroughs in large language models (LLMs) have led to notable successes in complex reasoning tasks, such as mathematical problem solving. A common strategy for improving performance is parallel thinking, in which multiple reasoning traces are generated and the final prediction is made using aggregation schemes like majority voting or best-of-$N$ decoding. However, two key challenges persist. First, multi-sample decoding incurs substantial inference latency, especially for long-form outputs. Second, effective mechanisms for reliably assessing the correctness of individual reasoning traces are still limited. To address these challenges, we introduce One-Token Verification (OTV), a computational method that estimates reasoning correctness in a single forward pass during generation. OTV is activated by a learnable token and integrated into the LLM via low-rank adaptation to probe internal reasoning signals through the key-value cache, supporting token-level correctness estimation at any stage of generation without disrupting primary reasoning. Experiments on mathematical reasoning benchmarks demonstrate that OTV consistently surpasses existing verifiers. Additionally, OTV reduces token usage by up to $90\%$ through correctness-guided early termination, prioritizing shorter, more reliable solutions.

One-Token Verification for Reasoning Correctness Estimation

TL;DR

One-Token Verification (OTV), a computational method that estimates reasoning correctness in a single forward pass during generation, and reduces token usage by up to through correctness-guided early termination, prioritizing shorter, more reliable solutions.

Abstract

Recent breakthroughs in large language models (LLMs) have led to notable successes in complex reasoning tasks, such as mathematical problem solving. A common strategy for improving performance is parallel thinking, in which multiple reasoning traces are generated and the final prediction is made using aggregation schemes like majority voting or best-of- decoding. However, two key challenges persist. First, multi-sample decoding incurs substantial inference latency, especially for long-form outputs. Second, effective mechanisms for reliably assessing the correctness of individual reasoning traces are still limited. To address these challenges, we introduce One-Token Verification (OTV), a computational method that estimates reasoning correctness in a single forward pass during generation. OTV is activated by a learnable token and integrated into the LLM via low-rank adaptation to probe internal reasoning signals through the key-value cache, supporting token-level correctness estimation at any stage of generation without disrupting primary reasoning. Experiments on mathematical reasoning benchmarks demonstrate that OTV consistently surpasses existing verifiers. Additionally, OTV reduces token usage by up to through correctness-guided early termination, prioritizing shorter, more reliable solutions.
Paper Structure (36 sections, 2 theorems, 19 equations, 5 figures, 6 tables, 2 algorithms)

This paper contains 36 sections, 2 theorems, 19 equations, 5 figures, 6 tables, 2 algorithms.

Key Result

Proposition 3.1

For any fixed $t$ and any state $\bm s_{t}$, among all measurable functions $f_ {\bm \phi}(\cdot)$, the minimizer of the conditional risk $\mathbb{E}\!\left[(f_ {\bm \phi}(\bm s_{t})-c(t,T,y))^2 \mid \bm s_{t}\right]$ is the conditional expectation

Figures (5)

  • Figure 1: Conceptual illustration of the proposed OTV. By reusing the KV cache and activating a LoRA-based verifier via a special token [ToT], OTV reliably estimates the correctness of reasoning traces in a single forward pass.
  • Figure 2: Confidence dynamics on three representative AIME24 problems (i.e., #3, #9, and #22). For each predictor, we plot the mean confidence trajectory over $32$ sampled reasoning traces, shown separately for traces that end with correct (red) and incorrect (green) final answers. Shaded bands around each mean curve denote the inter-quantile range across traces, summarizing cross-trace variability.
  • Figure 3: Effect of verifier capacity (i.e., LoRA rank) on training dynamics and downstream voting accuracy. Left: verifier training loss over optimization steps for the "probe" baseline, which trains only the regression head (no LoRA; no KV cache) and for OTV with varying LoRA ranks. Middle/Right: Weighted majority-voting accuracy on AIME, as a function of the number of sampled traces. All results are averaged over $64$ runs.
  • Figure 4: Evaluation on GSM8K using Qwen3-4B-Base.
  • Figure 5: Trace-level confidence trajectories across problems in (a) AIME24 and (b) AIME25.

Theorems & Definitions (4)

  • Proposition 3.1: Risk Minimizer under MSE
  • proof
  • Proposition 3.2
  • proof