A Unified Framework for Rethinking Policy Divergence Measures in GRPO
Qingyuan Wu, Yuhui Wang, Simon Sinong Zhan, Yanning Dai, Shilong Deng, Sarra Habchi, Qi Zhu, Matthias Gallé, Chao Huang
TL;DR
The paper addresses the sensitivity of policy updates in RLVR for LLMs to how policy divergence is constrained. It introduces a unified clipping framework that generalizes ratio-based and KL-based constraints, and identifies the KL3 estimator ${\rm KL3}_t(\theta)= w_t(\theta)-1-\log w_t(\theta)$ as an effective, low-variance divergence measure. Building on this, the authors propose ATR-GRPO, which implements an approximate trust-region clipping via KL3-derived ranges $[l_\delta^{ {\rm KL3}_{} }, u_\delta^{ {\rm KL3}_{} }]$, enabling asymmetric, data-efficient exploration while preserving GRPO efficiency. Theoretical analyses connect KL3-based clipping to ratio-based updates and demonstrate an entropy-driven exploration mechanism; experiments on mathematical reasoning benchmarks with Qwen3-1.7B and Qwen3-8B show improved training stability and final performance over state-of-the-art baselines. Overall, the work underscores the critical role of principled policy-divergence constraints in scalable RL for LLMs and offers practical methods for enhancing exploration and stability.
Abstract
Reinforcement Learning with Verified Reward (RLVR) has emerged as a critical paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). Most existing RLVR methods, such as GRPO and its variants, ensure stable updates by constraining policy divergence through clipping likelihood ratios. This paper introduces a unified clipping framework that characterizes existing methods via a general notion of policy divergence, encompassing both likelihood ratios and Kullback-Leibler (KL) divergences and extending to alternative measures. The framework provides a principled foundation for systematically analyzing how different policy divergence measures affect exploration and performance. We further identify the KL3 estimator, a variance-reduced Monte Carlo estimator of the KL divergence, as a key policy divergence constraint. We theoretically demonstrate that the KL3-based constraint is mathematically equivalent to an asymmetric ratio-based clipping that reallocates probability mass toward high-confidence actions, promoting stronger exploration while retaining the simplicity of GRPO-style methods. Empirical results on mathematical reasoning benchmarks demonstrate that incorporating the KL3 estimator into GRPO improves both training stability and final performance, highlighting the importance of principled policy divergence constraints in policy optimization.
