Flexible Entropy Control in RLVR with Gradient-Preserving Perspective

Kun Chen; Peng Shi; Fanfan Liu; Haibo Qiu; Zhixiong Zeng; Siqi Yang; Wenji Mao

Flexible Entropy Control in RLVR with Gradient-Preserving Perspective

Kun Chen, Peng Shi, Fanfan Liu, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao

TL;DR

This work tackles entropy collapse during RLVR training of LLMs by reframing entropy control through Gradient-Preserving Clipping. It develops a principled regulation mechanism with dynamic upper and lower clipping thresholds and introduces three entropy-control strategies—Increase-Then-Decrease, Decrease-Increase-Decrease, and Oscillatory Decay—to flexibly shape entropy over training. The authors present a theoretical justification based on the inner product between gradient signals and four important-sampling ratio regions, along with empirical validation on Qwen models trained with DAPO-MATH, showing reduced entropy collapse and improved performance on multiple math benchmarks. Collectively, the framework offers a scalable, principled approach to stabilize RLVR training and enhance the reasoning capabilities of LLMs.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.

Flexible Entropy Control in RLVR with Gradient-Preserving Perspective

TL;DR

Abstract

Paper Structure (35 sections, 34 equations, 13 figures, 3 tables)

This paper contains 35 sections, 34 equations, 13 figures, 3 tables.

Introduction
Preliminary
RL Algorithms of LLMs
The Policy Entropy of LLMs
Theoretical and Empirical Investigations
Methodology
Regulation Mechanism for Entropy Variations
Dynamic Upper Clipping Threshold
Dynamic Lower Clipping Threshold
Strategy Design for Entropy Control
Experiments
Experimental Setup
Experimental Results and Analysis
Analysis of Entropy and Performance
Analysis of Phase Ratios
...and 20 more sections

Figures (13)

Figure 1: Training dynamics of entropy and gradient norm during GRPO optimization, illustrating entropy collapse and the empirical evidence of the theoretical bound of gradient norm.
Figure 2: (a) Visualization of PPO clipping threshold regions and probability ratios; (b) Visualization of four entropy-sensitive regions (E1–E4), categorized by the relationship between the old probability ($\pi_{old}$) and the current probability ($\pi_{\theta}$). These regions distinguish between high ($>0.7$) and low ($\le 0.3$) probability states, as well as probability gains and drops; (c) Entropy dynamics curves showing how regions E1/E4 reduce entropy while E2/E3 increase it.
Figure 3: Schematic diagram of (a) dynamic upper clipping threshold and (b) dynamic lower clipping threshold
Figure 4: Increase-then-Decrease Entropy control strategy.
Figure 5: Decrease-Increase-Decrease Entropy control strategy.
...and 8 more figures

Flexible Entropy Control in RLVR with Gradient-Preserving Perspective

TL;DR

Abstract

Flexible Entropy Control in RLVR with Gradient-Preserving Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (13)