Table of Contents
Fetching ...

ScRPO: From Errors to Insights

Lianrui Li, Dakuan Lu, Jiawei Shao, Xuelong Li

TL;DR

ScRPO tackles the challenge of mathematical reasoning in large language models by integrating self-reflection and error correction into reinforcement learning. It introduces a two-stage framework: a trial-and-error GRPO-based stage that collects missteps into an error pool, and a periodic self-correction stage that analyzes errors to generate improved solutions, with rewards anchored to reflection tokens. Empirical results across GSM8k, MATH-500, AIME-2024, AMC, and Olympiad show that ScRPO consistently outperforms SFT, DPO, GRPO, and DAPO on both 1.5B and 7B models, with notable gains on challenging benchmarks like AIME. The work demonstrates the value of error-driven learning and reflection as scalable, feedback-efficient enhancements to mathematical reasoning in LLMs, advancing the reliability and capability of AI systems.

Abstract

We propose Self-correction Relative Policy Optimization (ScRPO), a novel reinforcement learning framework designed to enhance large language models on challenging mathematical problems by leveraging self-reflection and error correction. Our approach consists of two stages: (1) Trial-and-error learning stage: training the model with GRPO and collecting incorrect answers along with their corresponding questions in an error pool; (2) Self-correction learning stage: guiding the model to reflect on why its previous answers were wrong. Extensive experiments across multiple math reasoning benchmarks, including AIME, AMC, Olympiad, MATH-500, GSM8k, using Deepseek-Distill-Qwen-1.5B and Deepseek-Distill-Qwen-7B. The experimental results demonstrate that ScRPO consistently outperforms several post-training methods. These findings highlight ScRPO as a promising paradigm for enabling language models to self-improve on difficult tasks with limited external feedback, paving the way toward more reliable and capable AI systems.

ScRPO: From Errors to Insights

TL;DR

ScRPO tackles the challenge of mathematical reasoning in large language models by integrating self-reflection and error correction into reinforcement learning. It introduces a two-stage framework: a trial-and-error GRPO-based stage that collects missteps into an error pool, and a periodic self-correction stage that analyzes errors to generate improved solutions, with rewards anchored to reflection tokens. Empirical results across GSM8k, MATH-500, AIME-2024, AMC, and Olympiad show that ScRPO consistently outperforms SFT, DPO, GRPO, and DAPO on both 1.5B and 7B models, with notable gains on challenging benchmarks like AIME. The work demonstrates the value of error-driven learning and reflection as scalable, feedback-efficient enhancements to mathematical reasoning in LLMs, advancing the reliability and capability of AI systems.

Abstract

We propose Self-correction Relative Policy Optimization (ScRPO), a novel reinforcement learning framework designed to enhance large language models on challenging mathematical problems by leveraging self-reflection and error correction. Our approach consists of two stages: (1) Trial-and-error learning stage: training the model with GRPO and collecting incorrect answers along with their corresponding questions in an error pool; (2) Self-correction learning stage: guiding the model to reflect on why its previous answers were wrong. Extensive experiments across multiple math reasoning benchmarks, including AIME, AMC, Olympiad, MATH-500, GSM8k, using Deepseek-Distill-Qwen-1.5B and Deepseek-Distill-Qwen-7B. The experimental results demonstrate that ScRPO consistently outperforms several post-training methods. These findings highlight ScRPO as a promising paradigm for enabling language models to self-improve on difficult tasks with limited external feedback, paving the way toward more reliable and capable AI systems.

Paper Structure

This paper contains 19 sections, 14 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Illustration of ScRPO motivation. (a)How human learn mathematics. In the process of learning to solve math problems, humans not only need to constantly work on new problems, but reflecting on and summarizing based on previous incorrect problems is also a very critical step. (b)When LLM solves math problem, it may encounters failure. If it makes efforts on errors, its capability will be expanded.
  • Figure 2: An illustration of the proposed ScRPO method. ScRPO consists of continuous trial-and-error learning that collects incorrect answers, followed by periodic self-correction learning, where the model reflects on errors and generates improved responses.
  • Figure 3: Prompt template for self-correction learning stage. The model is instructed to analyze its previous incorrect attempt and generate a corrected solution. The reflection and corrected answer are trained through RL when the final output is correct.
  • Figure 4: Average scores of Deepseek-Distill-Qwen-1.5B and Deepseek-Distill-Qwen-7B ablation experiments across various benchmarks
  • Figure 5: Ablation study results of Deepseek-Distill-Qwen-1.5B and Deepseek-Distill-Qwen-7B on various benchmarks