Table of Contents
Fetching ...

Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification

Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, He He

TL;DR

This work demonstrates that reasoning models encode information about the correctness of intermediate answers in their hidden states. By training a lightweight binary probe on chunk representations of long chain-of-thought, the authors achieve well-calibrated predictions of intermediate-answer correctness and reveal lookahead signals that precede explicit answers. They validate the probe's usefulness as a verifier for an early-exit strategy, reducing inference tokens by up to 24% without sacrificing accuracy. The findings highlight untapped self-verification capabilities in reasoning models and point to practical paths to improve efficiency through internal, on-policy decision-making.

Abstract

Reasoning models have achieved remarkable performance on tasks like math and logical reasoning thanks to their ability to search during reasoning. However, they still suffer from overthinking, often performing unnecessary reasoning steps even after reaching the correct answer. This raises the question: can models evaluate the correctness of their intermediate answers during reasoning? In this work, we study whether reasoning models encode information about answer correctness through probing the model's hidden states. The resulting probe can verify intermediate answers with high accuracy and produces highly calibrated scores. Additionally, we find models' hidden states encode correctness of future answers, enabling early prediction of the correctness before the intermediate answer is fully formulated. We then use the probe as a verifier to decide whether to exit reasoning at intermediate answers during inference, reducing the number of inference tokens by 24\% without compromising performance. These findings confirm that reasoning models do encode a notion of correctness yet fail to exploit it, revealing substantial untapped potential to enhance their efficiency.

Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification

TL;DR

This work demonstrates that reasoning models encode information about the correctness of intermediate answers in their hidden states. By training a lightweight binary probe on chunk representations of long chain-of-thought, the authors achieve well-calibrated predictions of intermediate-answer correctness and reveal lookahead signals that precede explicit answers. They validate the probe's usefulness as a verifier for an early-exit strategy, reducing inference tokens by up to 24% without sacrificing accuracy. The findings highlight untapped self-verification capabilities in reasoning models and point to practical paths to improve efficiency through internal, on-policy decision-making.

Abstract

Reasoning models have achieved remarkable performance on tasks like math and logical reasoning thanks to their ability to search during reasoning. However, they still suffer from overthinking, often performing unnecessary reasoning steps even after reaching the correct answer. This raises the question: can models evaluate the correctness of their intermediate answers during reasoning? In this work, we study whether reasoning models encode information about answer correctness through probing the model's hidden states. The resulting probe can verify intermediate answers with high accuracy and produces highly calibrated scores. Additionally, we find models' hidden states encode correctness of future answers, enabling early prediction of the correctness before the intermediate answer is fully formulated. We then use the probe as a verifier to decide whether to exit reasoning at intermediate answers during inference, reducing the number of inference tokens by 24\% without compromising performance. These findings confirm that reasoning models do encode a notion of correctness yet fail to exploit it, revealing substantial untapped potential to enhance their efficiency.

Paper Structure

This paper contains 26 sections, 1 equation, 6 figures, 15 tables.

Figures (6)

  • Figure 1: An illustration of the probing method. On the left side, long CoT is parsed into multiple chunks, each corresponding to a reasoning path and contains an intermediate answer as termination. On the right side, representations for each chunk are obtained and probe is used to predict the probability of answer being correct.
  • Figure 2: ROC-AUC scores for each probe trained on hidden states from different reasoning models and datasets. We train a separate probe on each probing dataset and evaluate it on in-distribution test set.
  • Figure 3: Comparison on the performance on reasoning models (i.e., R1-Distill-Llama-8B, fine-tuned on the base Llama-3.1-8B model using long CoT data) and non-reasoning models (i.e., Llama-3.1-8B-Instruct) on MATH. For reasoning models, we show both the performance on predicting the correctness of intermediate answers (blue) and the final answers (green). For non-reasoning models, the data only contains the final answers (red).
  • Figure 4: Performance on predicting the correctness of the upcoming intermediate answers midway through a reasoning chunk. The results are obtained at different percentages of all paragraphs within each chunk. The task dataset and reasoning model used are MATH dataset and R1-Distill-Llama-8B.
  • Figure 5: Final answer accuracy versus inference token cost with different early-exit strategies. For confidence-based early-exit, the curve is obtained by varying the confidence threshold for answer correctness. For static early-exit, the curve is generated by varying the chunk number $m$.
  • ...and 1 more figures