Table of Contents
Fetching ...

CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

Zhenpeng Su, Leiyu Pan, Minxuan Lv, Yuntao Li, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou

TL;DR

The paper analyzes entropy dynamics in reinforcement learning for LLM fine-tuning and identifies gradients from clipped tokens as key regulators of exploration versus exploitation. It introduces CE-GPPO, a gradient-preserving clipping policy optimization algorithm that reintroduces and scales gradients from out-of-clip tokens using a stop-gradient design and two tunable factors, beta1 and beta2. The approach is theoretically justified and empirically validated on mathematical reasoning benchmarks, showing improved entropy stability and performance across model scales. CE-GPPO consistently outperforms strong baselines (GRPO, DAPO, CISPO, GSPO) and demonstrates robustness to hyperparameters, indicating practical impact for scalable, stable RL-based LLM fine-tuning.

Abstract

Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose \textbf{C}oordinating \textbf{E}ntropy via \textbf{G}radient-\textbf{P}reserving \textbf{P}olicy \textbf{O}ptimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.

CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

TL;DR

The paper analyzes entropy dynamics in reinforcement learning for LLM fine-tuning and identifies gradients from clipped tokens as key regulators of exploration versus exploitation. It introduces CE-GPPO, a gradient-preserving clipping policy optimization algorithm that reintroduces and scales gradients from out-of-clip tokens using a stop-gradient design and two tunable factors, beta1 and beta2. The approach is theoretically justified and empirically validated on mathematical reasoning benchmarks, showing improved entropy stability and performance across model scales. CE-GPPO consistently outperforms strong baselines (GRPO, DAPO, CISPO, GSPO) and demonstrates robustness to hyperparameters, indicating practical impact for scalable, stable RL-based LLM fine-tuning.

Abstract

Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose \textbf{C}oordinating \textbf{E}ntropy via \textbf{G}radient-\textbf{P}reserving \textbf{P}olicy \textbf{O}ptimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.

Paper Structure

This paper contains 42 sections, 32 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Left: Importance sampling distribution of tokens with different probabilities. Based on the distribution, all tokens can be categorized into four types: PA&HP, NA&LP, PA&LP and NA&HP. Center: The effect of the four token types on entropy dynamics. The two categories shown at the top contribute to entropy reduction, while those at the bottom contribute to entropy increase. Green check marks indicate tokens that lie within the clipping interval, whereas dashed circles denote tokens that partly fall outside the clipping interval. Right: Entropy instability curves caused by the absence of some PA&LP or NA&LP tokens.
  • Figure 2: Based on DeepSeek-R1-Distill-Qwen-7B, a comparison of GRPO, DAPO, and GPPO in terms of entropy dynamics and AIME25 benchmark accuracy.
  • Figure 3: Entropy dynamics and benchmark accuracy under different $\beta_1/\beta_2$ configurations.
  • Figure 4: Comparison of KL divergence and gradient norm dynamics between GRPO and CE-GPPO.
  • Figure 5: Comparison of CE-GPPO with other entropy collapse mitigation strategies. Native GRPO denotes the baseline without any mitigation strategy. $\alpha = 0.001/0.003$ indicate the addition of an entropy loss term to the Native GRPO baseline, where $\alpha$ represents the entropy loss coefficient. DAPO refers to applying the Clip Higher strategy on Native GRPO baseline.
  • ...and 1 more figures