Table of Contents
Fetching ...

Incentivizing LLMs to Self-Verify Their Answers

Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, Bo An

TL;DR

This work tackles the bottleneck in jointly leveraging post-training RL improvements and test-time scaling by introducing a self-verification framework that trains LLMs to both solve and verify their own answers within a single RL loop. It unifies answer generation and verification through GRPO-based optimization, augmented with a policy-aligned online buffer and a dynamic verification reward, enabling robust test-time scaling without external verifiers. Empirically, the approach yields strong post-training gains on math benchmarks and enables effective inference-time scaling, with verification-driven aggregation often outperforming external RM-based baselines. The results suggest that self-verification reduces distribution-shift issues between post-trained generators and verifiers, offering a practical path toward more reliable and scalable reasoning systems. Limitations include domain specificity to mathematics and potential opportunities to extend to longer multi-turn interactions or other reasoning domains.

Abstract

Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks through both post-training and test-time scaling laws. While prevalent test-time scaling approaches are often realized by using external reward models to guide the model generation process, we find that only marginal gains can be acquired when scaling a model post-trained on specific reasoning tasks. We identify that the limited improvement stems from distribution discrepancies between the specific post-trained generator and the general reward model. To address this, we propose a framework that incentivizes LLMs to self-verify their own answers. By unifying answer generation and verification within a single reinforcement learning (RL) process, we train models that can effectively assess the correctness of their own solutions. The trained model can further scale its performance at inference time by verifying its generations, without the need for external verifiers. We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B, demonstrating their capabilities across varying reasoning context lengths. Experiments on multiple mathematical reasoning benchmarks show that our models can not only improve post-training performance but also enable effective test-time scaling.

Incentivizing LLMs to Self-Verify Their Answers

TL;DR

This work tackles the bottleneck in jointly leveraging post-training RL improvements and test-time scaling by introducing a self-verification framework that trains LLMs to both solve and verify their own answers within a single RL loop. It unifies answer generation and verification through GRPO-based optimization, augmented with a policy-aligned online buffer and a dynamic verification reward, enabling robust test-time scaling without external verifiers. Empirically, the approach yields strong post-training gains on math benchmarks and enables effective inference-time scaling, with verification-driven aggregation often outperforming external RM-based baselines. The results suggest that self-verification reduces distribution-shift issues between post-trained generators and verifiers, offering a practical path toward more reliable and scalable reasoning systems. Limitations include domain specificity to mathematics and potential opportunities to extend to longer multi-turn interactions or other reasoning domains.

Abstract

Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks through both post-training and test-time scaling laws. While prevalent test-time scaling approaches are often realized by using external reward models to guide the model generation process, we find that only marginal gains can be acquired when scaling a model post-trained on specific reasoning tasks. We identify that the limited improvement stems from distribution discrepancies between the specific post-trained generator and the general reward model. To address this, we propose a framework that incentivizes LLMs to self-verify their own answers. By unifying answer generation and verification within a single reinforcement learning (RL) process, we train models that can effectively assess the correctness of their own solutions. The trained model can further scale its performance at inference time by verifying its generations, without the need for external verifiers. We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B, demonstrating their capabilities across varying reasoning context lengths. Experiments on multiple mathematical reasoning benchmarks show that our models can not only improve post-training performance but also enable effective test-time scaling.

Paper Structure

This paper contains 41 sections, 6 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Average performance of post-trained models (dotted lines) and test-time scaling methods (solid lines) on the MATH500 and AIME24 benchmarks. Our self-verification framework not only enhances post-training performance with RL on both problem-solving and verification, but also enables effective test-time scaling with increased generation numbers by verifying its own solutions.
  • Figure 2: The framework of our self-verification framework. The model is trained to solve mathematical reasoning problems and verify generated solutions simultaneously.
  • Figure 3: Token usage comparison between problem-solving and verification tasks.
  • Figure 4: Average time cost of different test-time scaling methods per problem from MATH500.
  • Figure 5: Test-time scaling performance of Self-Verification-Qwen-7B on math reasoning benchmarks including AIME24, AIME25, AMC23, and OlympiadBench.
  • ...and 3 more figures