On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization
Wenlong Deng, Yi Ren, Muchen Li, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis
TL;DR
The paper analyzes why Group Relative Policy Optimization (GRPO) exhibits Lazy Likelihood Displacement (LLD) during RL fine-tuning of LLMs for reasoning tasks, linking negative gradients to suboptimal updates on correct responses. It introduces Negative Token Hidden Reward (NTHR), a token-level penalty modulation that selectively downweights negative gradients on tokens in incorrect responses that strongly suppress correct-response likelihoods, exploiting GRPO’s group structure. The authors show that NTHR mitigates LLD and yields consistent gains on math reasoning benchmarks across 0.5B–3B parameter models, with robust ablations on penalty thresholds and scaling factors. They provide theoretical underpinnings via GRPO-as-group-preference proofs and a Group Weighted Hidden Embedding Score (GWHES) metric to identify LLD-susceptible samples, along with practical implementation details and complexity considerations. The work enhances data efficiency and robustness of GRPO-based RL for reasoning tasks, offering a principled approach to token-level penalty modulation that could generalize to related preference-based learning frameworks.
Abstract
Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.
