Incentivizing LLMs to Self-Verify Their Answers
Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, Bo An
TL;DR
This work tackles the bottleneck in jointly leveraging post-training RL improvements and test-time scaling by introducing a self-verification framework that trains LLMs to both solve and verify their own answers within a single RL loop. It unifies answer generation and verification through GRPO-based optimization, augmented with a policy-aligned online buffer and a dynamic verification reward, enabling robust test-time scaling without external verifiers. Empirically, the approach yields strong post-training gains on math benchmarks and enables effective inference-time scaling, with verification-driven aggregation often outperforming external RM-based baselines. The results suggest that self-verification reduces distribution-shift issues between post-trained generators and verifiers, offering a practical path toward more reliable and scalable reasoning systems. Limitations include domain specificity to mathematics and potential opportunities to extend to longer multi-turn interactions or other reasoning domains.
Abstract
Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks through both post-training and test-time scaling laws. While prevalent test-time scaling approaches are often realized by using external reward models to guide the model generation process, we find that only marginal gains can be acquired when scaling a model post-trained on specific reasoning tasks. We identify that the limited improvement stems from distribution discrepancies between the specific post-trained generator and the general reward model. To address this, we propose a framework that incentivizes LLMs to self-verify their own answers. By unifying answer generation and verification within a single reinforcement learning (RL) process, we train models that can effectively assess the correctness of their own solutions. The trained model can further scale its performance at inference time by verifying its generations, without the need for external verifiers. We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B, demonstrating their capabilities across varying reasoning context lengths. Experiments on multiple mathematical reasoning benchmarks show that our models can not only improve post-training performance but also enable effective test-time scaling.
