Table of Contents
Fetching ...

Flexible Entropy Control in RLVR with Gradient-Preserving Perspective

Kun Chen, Peng Shi, Fanfan Liu, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao

TL;DR

This work tackles entropy collapse during RLVR training of LLMs by reframing entropy control through Gradient-Preserving Clipping. It develops a principled regulation mechanism with dynamic upper and lower clipping thresholds and introduces three entropy-control strategies—Increase-Then-Decrease, Decrease-Increase-Decrease, and Oscillatory Decay—to flexibly shape entropy over training. The authors present a theoretical justification based on the inner product between gradient signals and four important-sampling ratio regions, along with empirical validation on Qwen models trained with DAPO-MATH, showing reduced entropy collapse and improved performance on multiple math benchmarks. Collectively, the framework offers a scalable, principled approach to stabilize RLVR training and enhance the reasoning capabilities of LLMs.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.

Flexible Entropy Control in RLVR with Gradient-Preserving Perspective

TL;DR

This work tackles entropy collapse during RLVR training of LLMs by reframing entropy control through Gradient-Preserving Clipping. It develops a principled regulation mechanism with dynamic upper and lower clipping thresholds and introduces three entropy-control strategies—Increase-Then-Decrease, Decrease-Increase-Decrease, and Oscillatory Decay—to flexibly shape entropy over training. The authors present a theoretical justification based on the inner product between gradient signals and four important-sampling ratio regions, along with empirical validation on Qwen models trained with DAPO-MATH, showing reduced entropy collapse and improved performance on multiple math benchmarks. Collectively, the framework offers a scalable, principled approach to stabilize RLVR training and enhance the reasoning capabilities of LLMs.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.
Paper Structure (35 sections, 34 equations, 13 figures, 3 tables)

This paper contains 35 sections, 34 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Training dynamics of entropy and gradient norm during GRPO optimization, illustrating entropy collapse and the empirical evidence of the theoretical bound of gradient norm.
  • Figure 2: (a) Visualization of PPO clipping threshold regions and probability ratios; (b) Visualization of four entropy-sensitive regions (E1–E4), categorized by the relationship between the old probability ($\pi_{old}$) and the current probability ($\pi_{\theta}$). These regions distinguish between high ($>0.7$) and low ($\le 0.3$) probability states, as well as probability gains and drops; (c) Entropy dynamics curves showing how regions E1/E4 reduce entropy while E2/E3 increase it.
  • Figure 3: Schematic diagram of (a) dynamic upper clipping threshold and (b) dynamic lower clipping threshold
  • Figure 4: Increase-then-Decrease Entropy control strategy.
  • Figure 5: Decrease-Increase-Decrease Entropy control strategy.
  • ...and 8 more figures