Table of Contents
Fetching ...

DCPO: Dynamic Clipping Policy Optimization

Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, Rihui Xin

TL;DR

DCPO tackles two major RLVR limitations in large language models: fixed token-level clipping and reward-standardization-induced gradient instability. It introduces Dynamic-Adaptive Clipping to widen updates for low-probability tokens and Smooth Advantage Standardization to aggregate rewards across steps, plus an Only Token Mean loss to preserve per-response importance. Across four Qwen-based models and four math benchmarks, DCPO achieves state-of-the-art or competitive Avg@1 and Avg@32, with significant boosts in nonzero-advantage utilization and training efficiency, and markedly lower token clipping ratios. Ablation confirms each component’s contribution and their synergy, supporting DCPO as a robust, data-efficient RLVR method for enhancing mathematical reasoning in LLMs.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning capabilities of large language models. However, existing approaches such as GRPO often suffer from zero gradients. This problem arises primarily due to fixed clipping bounds for token-level probability ratios and the standardization of identical rewards, which can lead to ineffective gradient updates and underutilization of generated responses. In this work, we propose Dynamic Clipping Policy Optimization(DCPO), which introduces a dynamic clipping strategy that adaptively adjusts clipping bounds based on token-specific prior probabilities to enhance token-level exploration, and a smooth advantage standardization technique that standardizes rewards across cumulative training steps to improve the response-level effective utilization of generated responses. DCPO achieved state-of-the-art performance on four benchmarks based on four different models. In particular, DCPO achieved an Avg@1 of 46.7 under greedy decoding and an Avg@32 of 38.8 under 32 times sampling on the AIME24 benchmark, surpassing DAPO (36.7/31.6), GRPO (36.7/32.1) and GSPO (40.0/34.9) on the Qwen2.5-Math-7B model. On the AIME25 benchmark based on Qwen2.5-14B, DCPO achieves a performance of (23.3/19.0), surpassing GRPO (13.3/10.5), DAPO (20.0/15.3) and GSPO (16.7/9.9). Furthermore, DCPO achieved an average 28% improvement in the nonzero advantage over GRPO in four models, doubled the training efficiency over DAPO, and significantly reduced the token clipping ratio by an order of magnitude compared to both GRPO and DAPO, while achieving superior performance. These results highlight DCPO's effectiveness in leveraging generated data more efficiently for reinforcement learning in large language models.

DCPO: Dynamic Clipping Policy Optimization

TL;DR

DCPO tackles two major RLVR limitations in large language models: fixed token-level clipping and reward-standardization-induced gradient instability. It introduces Dynamic-Adaptive Clipping to widen updates for low-probability tokens and Smooth Advantage Standardization to aggregate rewards across steps, plus an Only Token Mean loss to preserve per-response importance. Across four Qwen-based models and four math benchmarks, DCPO achieves state-of-the-art or competitive Avg@1 and Avg@32, with significant boosts in nonzero-advantage utilization and training efficiency, and markedly lower token clipping ratios. Ablation confirms each component’s contribution and their synergy, supporting DCPO as a robust, data-efficient RLVR method for enhancing mathematical reasoning in LLMs.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning capabilities of large language models. However, existing approaches such as GRPO often suffer from zero gradients. This problem arises primarily due to fixed clipping bounds for token-level probability ratios and the standardization of identical rewards, which can lead to ineffective gradient updates and underutilization of generated responses. In this work, we propose Dynamic Clipping Policy Optimization(DCPO), which introduces a dynamic clipping strategy that adaptively adjusts clipping bounds based on token-specific prior probabilities to enhance token-level exploration, and a smooth advantage standardization technique that standardizes rewards across cumulative training steps to improve the response-level effective utilization of generated responses. DCPO achieved state-of-the-art performance on four benchmarks based on four different models. In particular, DCPO achieved an Avg@1 of 46.7 under greedy decoding and an Avg@32 of 38.8 under 32 times sampling on the AIME24 benchmark, surpassing DAPO (36.7/31.6), GRPO (36.7/32.1) and GSPO (40.0/34.9) on the Qwen2.5-Math-7B model. On the AIME25 benchmark based on Qwen2.5-14B, DCPO achieves a performance of (23.3/19.0), surpassing GRPO (13.3/10.5), DAPO (20.0/15.3) and GSPO (16.7/9.9). Furthermore, DCPO achieved an average 28% improvement in the nonzero advantage over GRPO in four models, doubled the training efficiency over DAPO, and significantly reduced the token clipping ratio by an order of magnitude compared to both GRPO and DAPO, while achieving superior performance. These results highlight DCPO's effectiveness in leveraging generated data more efficiently for reinforcement learning in large language models.

Paper Structure

This paper contains 33 sections, 30 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: TCR across models and methods.
  • Figure 2: RUR progresses during training.
  • Figure 3: Ablation using the average Avg@32 based on Qwen2.5-Math-7B.
  • Figure 4: Clipping bound comparisons. Lines show bounds for fixed clipping ($\varepsilon=0.2$) vs. dynamic-adaptive clipping ($\varepsilon_{\text{low}}=0.16, \varepsilon_{\text{high}}=0.2$).
  • Figure 5: Avg@1 performance across benchmarks
  • ...and 2 more figures