Table of Contents
Fetching ...

From $\boldsymbol{\logπ}$ to $\boldsymbolπ$: Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight

Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Chaowen Hu, Cong Qin, Zekai Shao, Binbin Zheng, Lu Pan, Ke Zeng

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed a leap in Large Language Model (LLM) reasoning, yet its optimization dynamics remain fragile. Standard algorithms like GRPO enforce stability via ``hard clipping'', which inadvertently stifles exploration by discarding gradients of tokens outside the trust region. While recent ``soft clipping'' methods attempt to recover these gradients, they suffer from a critical challenge: relying on log-probability gradient ($\nabla_θ\log π_θ$) yields divergent weights as probabilities vanish, destabilizing LLM training. We rethink this convention by establishing probability gradient ($\nabla_θπ_θ$) as the superior optimization primitive. Accordingly, we propose Decoupled Gradient Policy Optimization (DGPO), which employs a decoupled decay mechanism based on importance sampling ratios. By applying asymmetric, continuous decay to boundary tokens, DGPO resolves the conflict between stability and sustained exploration. Extensive experiments across DeepSeek-R1-Distill-Qwen series models (1.5B/7B/14B) demonstrate that DGPO consistently outperforms strong baselines on various mathematical benchmarks, offering a robust and scalable solution for RLVR. Our code and implementation are available at: https://github.com/VenomRose-Juri/DGPO-RL.

From $\boldsymbol{\logπ}$ to $\boldsymbolπ$: Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed a leap in Large Language Model (LLM) reasoning, yet its optimization dynamics remain fragile. Standard algorithms like GRPO enforce stability via ``hard clipping'', which inadvertently stifles exploration by discarding gradients of tokens outside the trust region. While recent ``soft clipping'' methods attempt to recover these gradients, they suffer from a critical challenge: relying on log-probability gradient () yields divergent weights as probabilities vanish, destabilizing LLM training. We rethink this convention by establishing probability gradient () as the superior optimization primitive. Accordingly, we propose Decoupled Gradient Policy Optimization (DGPO), which employs a decoupled decay mechanism based on importance sampling ratios. By applying asymmetric, continuous decay to boundary tokens, DGPO resolves the conflict between stability and sustained exploration. Extensive experiments across DeepSeek-R1-Distill-Qwen series models (1.5B/7B/14B) demonstrate that DGPO consistently outperforms strong baselines on various mathematical benchmarks, offering a robust and scalable solution for RLVR. Our code and implementation are available at: https://github.com/VenomRose-Juri/DGPO-RL.
Paper Structure (77 sections, 50 equations, 5 figures, 12 tables)

This paper contains 77 sections, 50 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Schematic overview of our DGPO algorithm. While "hard clipping" methods (GRPO/ASPO) discard gradients at boundaries, and prior "soft clipping" approaches (CISPO/GPPO/CE-GPPO) risk divergence on the left boundary and limit exploration on the right boundary, DGPO optimizes the gradient decay mechanism accordingly. By enforcing a controlled "Slow Down" behavior for stability and a "Slow Down Gently" behavior to sustain exploration, DGPO effectively resolves the exploration-stability conflict while maintaining minimal bias against the true policy gradient.
  • Figure 2: Comparative analysis of gradient dynamics. We systematically contrast DGPO with the standard GRPO, prior "soft clipping" enhancements (CISPO, GPPO, and CE-GPPO), and importance sampling improvements (ASPO). The visualization highlights the theoretical properties regarding the exploration capability of clipped tokens and the alignment with the true policy gradient, demonstrating DGPO's superior stability and gradient consistency.
  • Figure 3: Comprehensive comparison of training dynamics and performance. Top row: DeepSeek-R1-Distill-Qwen-1.5B results. Bottom row: DeepSeek-R1-Distill-Qwen-7B results. Columns (L-R): Pass@K on AIME25, Avg@32 on AIME24/25, policy entropy, followed by hyperparameter analysis for Avg@32 and entropy.
  • Figure 4: Comparison of gradient weight distributions (Prob: Probability, Grad: Gradient). (a) Overall distribution. (b) Detailed scatter plots of grad weight vs. IS ratio (top row) and prob (bottom row) across three methods: GRPO (left), GPPO (middle), and DGPO (right). Points are colored by advantage. (c) Boundary distribution analysis.
  • Figure 5: Training dynamics of DeepSeek-R1-Distill-Qwen-14B comparing GRPO and DGPO.