Table of Contents
Fetching ...

A Unified Framework for Rethinking Policy Divergence Measures in GRPO

Qingyuan Wu, Yuhui Wang, Simon Sinong Zhan, Yanning Dai, Shilong Deng, Sarra Habchi, Qi Zhu, Matthias Gallé, Chao Huang

TL;DR

The paper addresses the sensitivity of policy updates in RLVR for LLMs to how policy divergence is constrained. It introduces a unified clipping framework that generalizes ratio-based and KL-based constraints, and identifies the KL3 estimator ${\rm KL3}_t(\theta)= w_t(\theta)-1-\log w_t(\theta)$ as an effective, low-variance divergence measure. Building on this, the authors propose ATR-GRPO, which implements an approximate trust-region clipping via KL3-derived ranges $[l_\delta^{ {\rm KL3}_{} }, u_\delta^{ {\rm KL3}_{} }]$, enabling asymmetric, data-efficient exploration while preserving GRPO efficiency. Theoretical analyses connect KL3-based clipping to ratio-based updates and demonstrate an entropy-driven exploration mechanism; experiments on mathematical reasoning benchmarks with Qwen3-1.7B and Qwen3-8B show improved training stability and final performance over state-of-the-art baselines. Overall, the work underscores the critical role of principled policy-divergence constraints in scalable RL for LLMs and offers practical methods for enhancing exploration and stability.

Abstract

Reinforcement Learning with Verified Reward (RLVR) has emerged as a critical paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). Most existing RLVR methods, such as GRPO and its variants, ensure stable updates by constraining policy divergence through clipping likelihood ratios. This paper introduces a unified clipping framework that characterizes existing methods via a general notion of policy divergence, encompassing both likelihood ratios and Kullback-Leibler (KL) divergences and extending to alternative measures. The framework provides a principled foundation for systematically analyzing how different policy divergence measures affect exploration and performance. We further identify the KL3 estimator, a variance-reduced Monte Carlo estimator of the KL divergence, as a key policy divergence constraint. We theoretically demonstrate that the KL3-based constraint is mathematically equivalent to an asymmetric ratio-based clipping that reallocates probability mass toward high-confidence actions, promoting stronger exploration while retaining the simplicity of GRPO-style methods. Empirical results on mathematical reasoning benchmarks demonstrate that incorporating the KL3 estimator into GRPO improves both training stability and final performance, highlighting the importance of principled policy divergence constraints in policy optimization.

A Unified Framework for Rethinking Policy Divergence Measures in GRPO

TL;DR

The paper addresses the sensitivity of policy updates in RLVR for LLMs to how policy divergence is constrained. It introduces a unified clipping framework that generalizes ratio-based and KL-based constraints, and identifies the KL3 estimator as an effective, low-variance divergence measure. Building on this, the authors propose ATR-GRPO, which implements an approximate trust-region clipping via KL3-derived ranges , enabling asymmetric, data-efficient exploration while preserving GRPO efficiency. Theoretical analyses connect KL3-based clipping to ratio-based updates and demonstrate an entropy-driven exploration mechanism; experiments on mathematical reasoning benchmarks with Qwen3-1.7B and Qwen3-8B show improved training stability and final performance over state-of-the-art baselines. Overall, the work underscores the critical role of principled policy-divergence constraints in scalable RL for LLMs and offers practical methods for enhancing exploration and stability.

Abstract

Reinforcement Learning with Verified Reward (RLVR) has emerged as a critical paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). Most existing RLVR methods, such as GRPO and its variants, ensure stable updates by constraining policy divergence through clipping likelihood ratios. This paper introduces a unified clipping framework that characterizes existing methods via a general notion of policy divergence, encompassing both likelihood ratios and Kullback-Leibler (KL) divergences and extending to alternative measures. The framework provides a principled foundation for systematically analyzing how different policy divergence measures affect exploration and performance. We further identify the KL3 estimator, a variance-reduced Monte Carlo estimator of the KL divergence, as a key policy divergence constraint. We theoretically demonstrate that the KL3-based constraint is mathematically equivalent to an asymmetric ratio-based clipping that reallocates probability mass toward high-confidence actions, promoting stronger exploration while retaining the simplicity of GRPO-style methods. Empirical results on mathematical reasoning benchmarks demonstrate that incorporating the KL3 estimator into GRPO improves both training stability and final performance, highlighting the importance of principled policy divergence constraints in policy optimization.
Paper Structure (35 sections, 7 theorems, 41 equations, 4 figures, 4 tables)

This paper contains 35 sections, 7 theorems, 41 equations, 4 figures, 4 tables.

Key Result

Theorem 4.1

Let the constraint be defined as ${\mathcal{C}}^{\rm ratio}_{t} (\theta) \coloneqq l_t \leq w_{t} (\theta) \leq u_t$. Then, for any parameter $\theta$ where the objective is differentiable, the gradient of the general objective is equivalent to that of the ratio-based objective: $\nabla \mathca

Figures (4)

  • Figure 1: Illustration of the KL3-based constraint.
  • Figure 2: Comparison of Clip, Clip-Higher, and ATR-based clipping on Qwen3-1.7B. The training curves for (a) return, (b) entropy, and (c) completion length are smoothed with a 100-step moving average window. (d) Evaluation performance of Mean@8 (Average).
  • Figure 3: Ablation experiments of ATR-GRPO on Qwen3-1.7B. The performance for (a) Mean@8 and (b) Pass@8 with varying $\delta$.
  • Figure 4: The performance (Average) of ATR-GRPO on Qwen3-1.7B for (a) Mean@K and (b) Pass@K with varying K.

Theorems & Definitions (10)

  • Theorem 4.1: Gradient Equivalence
  • Theorem 4.2: Equivalence and Asymmetry
  • Theorem 5.1: Policy Logits Difference
  • Theorem 5.2: Entropy Difference
  • Theorem 1.1: Equivalence and Asymmetry
  • proof
  • Theorem 1.2: Policy Logits Difference
  • proof
  • Theorem 1.3: Entropy Difference
  • proof