Learning to Self-Verify Makes Language Models Better Reasoners

Yuxin Chen; Yu Wang; Yi Zhang; Ziang Ye; Zhengzhou Cai; Yaorui Shi; Qi Gu; Hui Su; Xunliang Cai; Xiang Wang; An Zhang; Tat-Seng Chua

Learning to Self-Verify Makes Language Models Better Reasoners

Yuxin Chen, Yu Wang, Yi Zhang, Ziang Ye, Zhengzhou Cai, Yaorui Shi, Qi Gu, Hui Su, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua

TL;DR

The paper tackles the persistent asymmetry where language models excel at generating reasoning traces but struggle to verify their own outputs. It demonstrates that improving self-verification can, in fact, enhance generation performance, and introduces a self-verification training pipeline within a reinforcement-learning-with-verifiable-rewards framework. Building on this, it presents a decoupled multi-task RL approach that jointly leverages generation and verification through stage-wise initialization or alternating training, achieving gains across multiple math benchmarks and model sizes. The findings suggest verification signals are a powerful, independent training objective that can improve efficiency (fewer tokens) and enable test-time scaling, with broad implications for more reliable, scalable reasoning systems.

Abstract

Recent large language models (LLMs) achieve strong performance in generating promising reasoning paths for complex tasks. However, despite powerful generation ability, LLMs remain weak at verifying their own answers, revealing a persistent capability asymmetry between generation and self-verification. In this work, we conduct an in-depth investigation of this asymmetry throughout training evolution and show that, even on the same task, improving generation does not lead to corresponding improvements in self-verification. Interestingly, we find that the reverse direction of this asymmetry behaves differently: learning to self-verify can effectively improve generation performance, achieving accuracy comparable to standard generation training while yielding more efficient and effective reasoning traces. Building on this observation, we further explore integrating self-verification into generation training by formulating a multi-task reinforcement learning framework, where generation and self-verification are optimized as two independent but complementary objectives. Extensive experiments across benchmarks and models demonstrate performance gains over generation-only training in both generation and verification capabilities.

Learning to Self-Verify Makes Language Models Better Reasoners

TL;DR

Abstract

Paper Structure (34 sections, 6 equations, 4 figures, 4 tables)

This paper contains 34 sections, 6 equations, 4 figures, 4 tables.

Introduction
Preliminary
RLVR
Generation Training
Verification Training
Learning to Self-Verify
Self-Verification Framework
On-Policy Sample Collection
Post-Processing
Training
Experimental Setup
Dataset and Benchmarks
Implementation
Evaluation
Main Results
...and 19 more sections

Figures (4)

Figure 1: Training dynamics of Qwen2.5-1.5B-Instruct. (Top) It reveals a persistent asymmetry between generation and self-verification: learning to generate does not lead to improved self-verification ability, even on the same task. (Down) In the reverse direction, learning to self-verify not only improves self-verification ability but also leads to improved generation performance.
Figure 2: Overview of our self-verification training framework. We collect on-policy problem-solving trajectories from the model and obtain correctness labels from a verifier. These trajectories are then processed through a post-processing pipeline, including data balancing, filtering, and diversity-aware sampling, to construct self-verification training data, which is used to train the model to judge the correctness of its own answers. We find that training the model solely for self-verification already leads to improved generation performance. Integrating this self-verification objective into generation training further strengthens the model's generation ability.
Figure 3: Comparison of accuracy and token usage between generation training and self-verification training on AIME24 with Qwen2.5-1.5B-Instruct.
Figure 4: Performance comparison under partially corrupted reasoning prefix setting.

Learning to Self-Verify Makes Language Models Better Reasoners

TL;DR

Abstract

Learning to Self-Verify Makes Language Models Better Reasoners

Authors

TL;DR

Abstract

Table of Contents

Figures (4)