Table of Contents
Fetching ...

ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification

Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, Jihoon Tack

TL;DR

ReVISE introduces intrinsic self-verification for LLMs, enabling the model to verify its own reasoning and decide whether to refine using a dedicated [refine] token. The method employs a two-stage curriculum with preference learning (via SFT and DPO) to first learn verification and then corrective refinement, avoiding external verifiers or RL. At inference, a confidence-aware sampling strategy leverages the model's verification signal to improve test-time accuracy with scalable computation. Empirically, ReVISE improves reasoning performance on GSM8K, MATH-500, and MBPP, demonstrates test-time scalability, and shows robust cross-domain generalization, highlighting practical gains for complex reasoning tasks and safety-conscious applications.

Abstract

Self-awareness, i.e., the ability to assess and correct one's own generation, is a fundamental aspect of human intelligence, making its replication in large language models (LLMs) an important yet challenging task. Previous works tackle this by employing extensive reinforcement learning or rather relying on large external verifiers. In this work, we propose Refine via Intrinsic Self-Verification (ReVISE), an efficient and effective framework that enables LLMs to self-correct their outputs through self-verification. The core idea of ReVISE is to enable LLMs to verify their reasoning processes and continually rethink reasoning trajectories based on its verification. We introduce a structured curriculum based upon online preference learning to implement this efficiently. Specifically, as ReVISE involves two challenging tasks (i.e., self-verification and reasoning correction), we tackle each task sequentially using curriculum learning, collecting both failed and successful reasoning paths to construct preference pairs for efficient training. During inference, our approach enjoys natural test-time scaling by integrating self-verification and correction capabilities, further enhanced by our proposed confidence-aware decoding mechanism. Our experiments on various reasoning tasks demonstrate that ReVISE achieves efficient self-correction and significantly improves reasoning performance.

ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification

TL;DR

ReVISE introduces intrinsic self-verification for LLMs, enabling the model to verify its own reasoning and decide whether to refine using a dedicated [refine] token. The method employs a two-stage curriculum with preference learning (via SFT and DPO) to first learn verification and then corrective refinement, avoiding external verifiers or RL. At inference, a confidence-aware sampling strategy leverages the model's verification signal to improve test-time accuracy with scalable computation. Empirically, ReVISE improves reasoning performance on GSM8K, MATH-500, and MBPP, demonstrates test-time scalability, and shows robust cross-domain generalization, highlighting practical gains for complex reasoning tasks and safety-conscious applications.

Abstract

Self-awareness, i.e., the ability to assess and correct one's own generation, is a fundamental aspect of human intelligence, making its replication in large language models (LLMs) an important yet challenging task. Previous works tackle this by employing extensive reinforcement learning or rather relying on large external verifiers. In this work, we propose Refine via Intrinsic Self-Verification (ReVISE), an efficient and effective framework that enables LLMs to self-correct their outputs through self-verification. The core idea of ReVISE is to enable LLMs to verify their reasoning processes and continually rethink reasoning trajectories based on its verification. We introduce a structured curriculum based upon online preference learning to implement this efficiently. Specifically, as ReVISE involves two challenging tasks (i.e., self-verification and reasoning correction), we tackle each task sequentially using curriculum learning, collecting both failed and successful reasoning paths to construct preference pairs for efficient training. During inference, our approach enjoys natural test-time scaling by integrating self-verification and correction capabilities, further enhanced by our proposed confidence-aware decoding mechanism. Our experiments on various reasoning tasks demonstrate that ReVISE achieves efficient self-correction and significantly improves reasoning performance.

Paper Structure

This paper contains 21 sections, 7 equations, 8 figures, 18 tables.

Figures (8)

  • Figure 1: Overview of ReVISE. Left: ReVISE is a self-verifying and self-correcting reasoning framework. It first generates an initial answer, verifies its correctness, and decides whether to stop or refine. If the model generates the $[\mathtt{refine}]$ token, it refines the initial reasoning. Right: The structured curriculum-based training pipeline of ReVISE. In the first stage, the model learns self-verification by selecting between $[\mathtt{eos}]$ and $[\mathtt{refine}]$. In the second stage, it learns to correct reasoning mistakes using golden data.
  • Figure 2: Test-time scaling comparison between ReVISE (Ours) and baselines, including SFT, RFT, STAR$^+$, and majority voting for ReVISE (Ours (Simple Maj.)) at sampling sizes $N\in\{1,2,3,4,8\}$. (a) Results for Llama-3.2-1B on the GSM8K dataset. (b) Results for Llama-3.2-8B on the MATH dataset. ReVISE consistently outperforms baselines across all sample sizes and datasets.
  • Figure 3: Ablation study on curriculum learning in the aspect of (a) final accuracy (%) and (b) self-verification accuracy reported with AUROC (%). The experiments are conducted using Llama-3.1-1B on the GSM8K dataset. The comparison includes a model trained without curriculum learning (w/o Cur.), trained for only stage 1 (Stage 1), and trained using the full two-stage curriculum learning approach (ReVISE) (Stage 2). (a) Accuracy improves with curriculum learning by mitigating conflicts between competing objectives during early training stages. (b) AUROC results demonstrate enhanced classification performance of corrected and incorrect responses and effective transfer from Stage 1 to the final ReVISE model.
  • Figure 4: Ablation study on DPO loss, evaluated on the GSM8K benchmark. Removing DPO loss significantly reduces accuracy.
  • Figure 5: Distribution histogram of $\mathcal{M}(\text{$[\mathtt{eos}]$\xspace}) - \mathcal{M}(\text{$[\mathtt{refine}]$\xspace})$ (ignored context $x$ for simplicity). $\mathcal{M}(\text{$[\mathtt{eos}]$\xspace}) - \mathcal{M}(\text{$[\mathtt{refine}]$\xspace})=0$ is the threshold of ReVISE trigger intrinsically refine or not. Experiments are conducted using the Llama-3.2-1B model.
  • ...and 3 more figures