Flexible Entropy Control in RLVR with Gradient-Preserving Perspective
Kun Chen, Peng Shi, Fanfan Liu, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao
TL;DR
This work tackles entropy collapse during RLVR training of LLMs by reframing entropy control through Gradient-Preserving Clipping. It develops a principled regulation mechanism with dynamic upper and lower clipping thresholds and introduces three entropy-control strategies—Increase-Then-Decrease, Decrease-Increase-Decrease, and Oscillatory Decay—to flexibly shape entropy over training. The authors present a theoretical justification based on the inner product between gradient signals and four important-sampling ratio regions, along with empirical validation on Qwen models trained with DAPO-MATH, showing reduced entropy collapse and improved performance on multiple math benchmarks. Collectively, the framework offers a scalable, principled approach to stabilize RLVR training and enhance the reasoning capabilities of LLMs.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.
