Learning to Self-Verify Makes Language Models Better Reasoners
Yuxin Chen, Yu Wang, Yi Zhang, Ziang Ye, Zhengzhou Cai, Yaorui Shi, Qi Gu, Hui Su, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua
TL;DR
The paper tackles the persistent asymmetry where language models excel at generating reasoning traces but struggle to verify their own outputs. It demonstrates that improving self-verification can, in fact, enhance generation performance, and introduces a self-verification training pipeline within a reinforcement-learning-with-verifiable-rewards framework. Building on this, it presents a decoupled multi-task RL approach that jointly leverages generation and verification through stage-wise initialization or alternating training, achieving gains across multiple math benchmarks and model sizes. The findings suggest verification signals are a powerful, independent training objective that can improve efficiency (fewer tokens) and enable test-time scaling, with broad implications for more reliable, scalable reasoning systems.
Abstract
Recent large language models (LLMs) achieve strong performance in generating promising reasoning paths for complex tasks. However, despite powerful generation ability, LLMs remain weak at verifying their own answers, revealing a persistent capability asymmetry between generation and self-verification. In this work, we conduct an in-depth investigation of this asymmetry throughout training evolution and show that, even on the same task, improving generation does not lead to corresponding improvements in self-verification. Interestingly, we find that the reverse direction of this asymmetry behaves differently: learning to self-verify can effectively improve generation performance, achieving accuracy comparable to standard generation training while yielding more efficient and effective reasoning traces. Building on this observation, we further explore integrating self-verification into generation training by formulating a multi-task reinforcement learning framework, where generation and self-verification are optimized as two independent but complementary objectives. Extensive experiments across benchmarks and models demonstrate performance gains over generation-only training in both generation and verification capabilities.
