Table of Contents
Fetching ...

EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity

Xingjian Zhang, Siwei Wen, Wenjun Wu, Lei Huang

TL;DR

This work tackles the advantage collapse problem in Group Relative Policy Optimization (GRPO) by analyzing the shortcomings of model reflection and fine-grained policy entropy. It introduces EDGE-GRPO, combining Guided Error Correction (GEC) to diversify responses and Entropy-Driven Advantage (EDA) to diversify learning signals, thereby improving gradient updates with sparse rewards. Across multiple math-reasoning benchmarks, EDGE-GRPO delivers substantial, data-efficient gains with only 1K training samples, outperforming vanilla GRPO and several baselines and approaching or surpassing larger, data-hungry models. The results demonstrate the importance of fine-grained entropy-based guidance and external corrections in stabilizing training and enhancing reasoning performance. This approach offers a practical path to more efficient, robust reasoning models in settings with sparse feedback.

Abstract

Large Language Models (LLMs) have made remarkable progress in enhancing step-by-step reasoning through reinforcement learning. However, the Group Relative Policy Optimization (GRPO) algorithm, which relies on sparse reward rules, often encounters the issue of identical rewards within groups, leading to the advantage collapse problem. Existing works typically address this challenge from two perspectives: enforcing model reflection to enhance response diversity, and introducing internal feedback to augment the training signal (advantage). In this work, we begin by analyzing the limitations of model reflection and investigating the policy entropy of responses at the fine-grained sample level. Based on our experimental findings, we propose the EDGE-GRPO algorithm, which adopts \textbf{E}ntropy-\textbf{D}riven Advantage and \textbf{G}uided \textbf{E}rror Correction to effectively mitigate the problem of advantage collapse. Extensive experiments on several main reasoning benchmarks demonstrate the effectiveness and superiority of our approach. It is available at https://github.com/ZhangXJ199/EDGE-GRPO.

EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity

TL;DR

This work tackles the advantage collapse problem in Group Relative Policy Optimization (GRPO) by analyzing the shortcomings of model reflection and fine-grained policy entropy. It introduces EDGE-GRPO, combining Guided Error Correction (GEC) to diversify responses and Entropy-Driven Advantage (EDA) to diversify learning signals, thereby improving gradient updates with sparse rewards. Across multiple math-reasoning benchmarks, EDGE-GRPO delivers substantial, data-efficient gains with only 1K training samples, outperforming vanilla GRPO and several baselines and approaching or surpassing larger, data-hungry models. The results demonstrate the importance of fine-grained entropy-based guidance and external corrections in stabilizing training and enhancing reasoning performance. This approach offers a practical path to more efficient, robust reasoning models in settings with sparse feedback.

Abstract

Large Language Models (LLMs) have made remarkable progress in enhancing step-by-step reasoning through reinforcement learning. However, the Group Relative Policy Optimization (GRPO) algorithm, which relies on sparse reward rules, often encounters the issue of identical rewards within groups, leading to the advantage collapse problem. Existing works typically address this challenge from two perspectives: enforcing model reflection to enhance response diversity, and introducing internal feedback to augment the training signal (advantage). In this work, we begin by analyzing the limitations of model reflection and investigating the policy entropy of responses at the fine-grained sample level. Based on our experimental findings, we propose the EDGE-GRPO algorithm, which adopts \textbf{E}ntropy-\textbf{D}riven Advantage and \textbf{G}uided \textbf{E}rror Correction to effectively mitigate the problem of advantage collapse. Extensive experiments on several main reasoning benchmarks demonstrate the effectiveness and superiority of our approach. It is available at https://github.com/ZhangXJ199/EDGE-GRPO.

Paper Structure

This paper contains 26 sections, 8 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Performance comparison with other open-source models on Olympiad and Minerva. Our method achieves competitive and excellent performance with only 1K training samples. These models are all post-trained based on Qwen2.5-Math-7B.
  • Figure 2: The reflection performance of different models. Upper: For most models, the accuracy of responses that involve self-reflection is significantly lower than the overall accuracy. Left: Fine-tuning with high-quality data that includes reflection processes helps improve the accuracy of model reflection. Right: Even when the model is forced to reflect on incorrect responses, the improvement in accuracy remains limited, these results are averaged over four types of reflection prompts.
  • Figure 3: Left: The relative confidence of different models in correct responses under various temperature settings. The area of the blue squares serves as a proxy for the model’s relative confidence, with larger areas reflecting greater confidence in correct responses. Right: The proportion of correct responses with entropy higher than the average and incorrect responses with entropy lower than the average across different models. This results are evaluated under the setting of temperature=0.1. We provide more detailed experimental results and the policy entropy distribution of different models in the Appendix.
  • Figure 4: The overall framework of EDGE-GRPO algorithm. By introducing Guided Error Correction at the response level to enhance response diversity and Entropy-Driven Advantage at the signal level to increase advantage diversity, we mitigate the advantage collapse problem in the vanilla GRPO.
  • Figure 5: The changes in intra-group advantage variance during training for different methods. Our method maintains a relatively high level without significant decline.
  • ...and 2 more figures