Table of Contents
Fetching ...

On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

Wenlong Deng, Yi Ren, Muchen Li, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis

TL;DR

The paper analyzes why Group Relative Policy Optimization (GRPO) exhibits Lazy Likelihood Displacement (LLD) during RL fine-tuning of LLMs for reasoning tasks, linking negative gradients to suboptimal updates on correct responses. It introduces Negative Token Hidden Reward (NTHR), a token-level penalty modulation that selectively downweights negative gradients on tokens in incorrect responses that strongly suppress correct-response likelihoods, exploiting GRPO’s group structure. The authors show that NTHR mitigates LLD and yields consistent gains on math reasoning benchmarks across 0.5B–3B parameter models, with robust ablations on penalty thresholds and scaling factors. They provide theoretical underpinnings via GRPO-as-group-preference proofs and a Group Weighted Hidden Embedding Score (GWHES) metric to identify LLD-susceptible samples, along with practical implementation details and complexity considerations. The work enhances data efficiency and robustness of GRPO-based RL for reasoning tasks, offering a principled approach to token-level penalty modulation that could generalize to related preference-based learning frameworks.

Abstract

Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.

On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

TL;DR

The paper analyzes why Group Relative Policy Optimization (GRPO) exhibits Lazy Likelihood Displacement (LLD) during RL fine-tuning of LLMs for reasoning tasks, linking negative gradients to suboptimal updates on correct responses. It introduces Negative Token Hidden Reward (NTHR), a token-level penalty modulation that selectively downweights negative gradients on tokens in incorrect responses that strongly suppress correct-response likelihoods, exploiting GRPO’s group structure. The authors show that NTHR mitigates LLD and yields consistent gains on math reasoning benchmarks across 0.5B–3B parameter models, with robust ablations on penalty thresholds and scaling factors. They provide theoretical underpinnings via GRPO-as-group-preference proofs and a Group Weighted Hidden Embedding Score (GWHES) metric to identify LLD-susceptible samples, along with practical implementation details and complexity considerations. The work enhances data efficiency and robustness of GRPO-based RL for reasoning tasks, offering a principled approach to token-level penalty modulation that could generalize to related preference-based learning frameworks.

Abstract

Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.

Paper Structure

This paper contains 24 sections, 3 theorems, 25 equations, 6 figures, 10 tables, 1 algorithm.

Key Result

Lemma 4.2

When reward is binary, GRPO performs preference optimization between two distinct groups: the group of correct responses ($r_i = 1$) and the group of incorrect responses ($r_i = 0$). Specifically, the optimization objective reduces to the following: where $p\overset{\Delta}{=} p(\bm{x}) \overset{\Delta}{=} \frac{1}{G}\sum_{i\in[G]}\mathbbm{1}[r_i(x)=1]$ denotes the correctness rate for a given in

Figures (6)

  • Figure 1: We show that negative gradients can lead to small or reduced likelihood change of positive samples in GRPO. The log-likelihood gains achieved by Pos Only training (orange) are significantly higher than those from GRPO (blue) for Qwen-0.5B-Ins (a) and Deepseek-1.5B (b). In Qwen-Math-1.5B (c), samples with small or reduced $\Delta(\bm{x})$ (left) are primarily influenced by negative gradients, as evidenced by their larger $\Delta(\bm{x})$ in the Pos Only setup. However, some samples on the right show smaller $\Delta(\bm{x})$ than in GRPO, indicating that negative gradients are not always harmful.
  • Figure 2: Inspecting negative (incorrect) samples of questions with small average likelihood change $\Delta(\bm{x})$ (Eq. \ref{['eq:LL change']}) reveals that they are either nearly correct (Left) or get the correct response in a wrong answer format (Right). Thus, penalizing entire negative sample responses might be suboptimal. Red dashed lines denote omitted reasoning steps.
  • Figure 3: Key insight: Tokens of negative samples (incorrect responses) can be logically or step-correct. Tokens with high NTHR tend to strongly correlate with these types of tokens (highlighted in red). The bold dots represent omitted reasoning.
  • Figure 4: GRPO$+$NTHR consistently improves likelihood change of correct responses, as indicated by the green bars exceeding the blue bars. While GRPO$+$Random offers only modest improvements, GRPO$+$NTHR consistently outperforms it, highlighting the effectiveness of NTHR in identifying LLD tokens.
  • Figure 5: Performance across training iterations for various models, NTHR consistently outperforms GRPO for most of the training process.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Definition 4.1
  • Lemma 4.2
  • Theorem 4.4
  • Corollary 4.5