DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization
Xuan Xie, Xuan Wang, Wenjie Wang, Shuai Chen, Wei Lin
TL;DR
DaGRPO tackles gradient instability in Group Relative Policy Optimization by introducing distinctiveness-aware mechanisms: sequence-level gradient rectification to filter low-distinctiveness samples and off-policy data augmentation to provide high-quality reasoning anchors. The approach stabilizes training, accelerates emergence of long-chain reasoning, and yields state-of-the-art performance on 9 math and OOD benchmarks, including a +4.7% average gain over GRPO on in-distribution math tasks. Ablations show the gradient-rectification component boosts performance even without off-policy data, while anchors significantly aid hard tasks and generalization. Limitations include the computational cost of LLM-based scoring and dependence on external expert data for anchors, pointing to directions for future efficiency improvements and self-bootstrapping methods.
Abstract
The evolution of Large Language Models (LLMs) has catalyzed a paradigm shift from superficial instruction following to rigorous long-horizon reasoning. While Group Relative Policy Optimization (GRPO) has emerged as a pivotal mechanism for eliciting such post-training reasoning capabilities due to its exceptional performance, it remains plagued by significant training instability and poor sample efficiency. We theoretically identify the root cause of these issues as the lack of distinctiveness within on-policy rollouts: for routine queries, highly homogeneous samples induce destructive gradient conflicts; whereas for hard queries, the scarcity of valid positive samples results in ineffective optimization. To bridge this gap, we propose Distinctiveness-aware Group Relative Policy Optimization (DaGRPO). DaGRPO incorporates two core mechanisms: (1) Sequence-level Gradient Rectification, which utilizes fine-grained scoring to dynamically mask sample pairs with low distinctiveness, thereby eradicating gradient conflicts at the source; and (2) Off-policy Data Augmentation, which introduces high-quality anchors to recover training signals for challenging tasks. Extensive experiments across 9 mathematical reasoning and out-of-distribution (OOD) generalization benchmarks demonstrate that DaGRPO significantly surpasses existing SFT, GRPO, and hybrid baselines, achieving new state-of-the-art performance (e.g., a +4.7% average accuracy gain on math benchmarks). Furthermore, in-depth analysis confirms that DaGRPO effectively mitigates gradient explosion and accelerates the emergence of long-chain reasoning capabilities.
