ScRPO: From Errors to Insights
Lianrui Li, Dakuan Lu, Jiawei Shao, Xuelong Li
TL;DR
ScRPO tackles the challenge of mathematical reasoning in large language models by integrating self-reflection and error correction into reinforcement learning. It introduces a two-stage framework: a trial-and-error GRPO-based stage that collects missteps into an error pool, and a periodic self-correction stage that analyzes errors to generate improved solutions, with rewards anchored to reflection tokens. Empirical results across GSM8k, MATH-500, AIME-2024, AMC, and Olympiad show that ScRPO consistently outperforms SFT, DPO, GRPO, and DAPO on both 1.5B and 7B models, with notable gains on challenging benchmarks like AIME. The work demonstrates the value of error-driven learning and reflection as scalable, feedback-efficient enhancements to mathematical reasoning in LLMs, advancing the reliability and capability of AI systems.
Abstract
We propose Self-correction Relative Policy Optimization (ScRPO), a novel reinforcement learning framework designed to enhance large language models on challenging mathematical problems by leveraging self-reflection and error correction. Our approach consists of two stages: (1) Trial-and-error learning stage: training the model with GRPO and collecting incorrect answers along with their corresponding questions in an error pool; (2) Self-correction learning stage: guiding the model to reflect on why its previous answers were wrong. Extensive experiments across multiple math reasoning benchmarks, including AIME, AMC, Olympiad, MATH-500, GSM8k, using Deepseek-Distill-Qwen-1.5B and Deepseek-Distill-Qwen-7B. The experimental results demonstrate that ScRPO consistently outperforms several post-training methods. These findings highlight ScRPO as a promising paradigm for enabling language models to self-improve on difficult tasks with limited external feedback, paving the way toward more reliable and capable AI systems.
