Table of Contents
Fetching ...

S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning

Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, Jia Li

TL;DR

S$^2$R presents an efficient two-stage framework to bolster LLM reasoning by teaching self-verification and self-correction during inference. It initializes these behaviors via dynamic trial-and-error trajectories and supervised fine-tuning, then strengthens them with both outcome-level and process-level reinforcement learning, including an offline RL variant. Across three base models and seven math benchmarks, S$^2$R delivers substantial gains with minimal initialization data, and demonstrations show cross-domain generalization to non-math tasks. The work provides nuanced insights into when process-level versus outcome-level RL is most beneficial and highlights data-efficient strategies for enhancing deep thinking in smaller LLMs. Overall, S$^2$R offers a practical, scalable path to improve robust reasoning in real-world settings.

Abstract

Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs' deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S$^2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our results demonstrate that, with only 3.1k self-verifying and self-correcting behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0\% to 81.6\%, outperforming models trained on an equivalent amount of long-CoT distilled data. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S$^2$R. Our code and data are available at https://github.com/NineAbyss/S2R.

S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning

TL;DR

SR presents an efficient two-stage framework to bolster LLM reasoning by teaching self-verification and self-correction during inference. It initializes these behaviors via dynamic trial-and-error trajectories and supervised fine-tuning, then strengthens them with both outcome-level and process-level reinforcement learning, including an offline RL variant. Across three base models and seven math benchmarks, SR delivers substantial gains with minimal initialization data, and demonstrations show cross-domain generalization to non-math tasks. The work provides nuanced insights into when process-level versus outcome-level RL is most beneficial and highlights data-efficient strategies for enhancing deep thinking in smaller LLMs. Overall, SR offers a practical, scalable path to improve robust reasoning in real-world settings.

Abstract

Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs' deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce SR, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our results demonstrate that, with only 3.1k self-verifying and self-correcting behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0\% to 81.6\%, outperforming models trained on an equivalent amount of long-CoT distilled data. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of SR. Our code and data are available at https://github.com/NineAbyss/S2R.
Paper Structure (70 sections, 17 equations, 7 figures, 10 tables)

This paper contains 70 sections, 17 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: The data efficiency of S$^2$r compared to competitive methods, with all models initialized from Qwen2.5-Math-7B.
  • Figure 2: Overview of S$^2$r.
  • Figure 3: Evaluation on verification and correction.
  • Figure 4: The accuracy and average trial number of different models across difficulty levels. Evaluated on MATH500 test set.
  • Figure 5: SFT data example.
  • ...and 2 more figures