Table of Contents
Fetching ...

Clip-Low Increases Entropy and Clip-High Decreases Entropy in Reinforcement Learning of Large Language Models

Jaesung R. Park, Junsu Kim, Gyeongman Kim, Jinyoung Jo, Sean Choi, Jaewoong Cho, Ernest K. Ryu

TL;DR

This work reveals that clipping in PPO/GRPO introduces entropy biases during RLVR for large language models: clip-low raises policy entropy while clip-high lowers it, with the high clipping effect often dominating and causing entropy collapse under standard settings. Through a toy analysis with random rewards and empirical RLVR experiments on math-reasoning tasks, the authors demonstrate that entropy can be controlled by tuning the asymmetric clipping parameters, enabling sustained exploration without sacrificing performance. The findings provide a mechanistic understanding of entropy dynamics in RLVR and offer a practical clipping-based tool to prevent entropy collapse, potentially improving long-horizon training stability and reasoning capabilities. Overall, the paper shows that deliberate entropy management via clipping can maintain exploration, support continued learning, and improve pass@k metrics in LLM reasoning tasks.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has recently emerged as the leading approach for enhancing the reasoning capabilities of large language models (LLMs). However, RLVR is prone to entropy collapse, where the LLM quickly converges to a near-deterministic form, hindering exploration and progress during prolonged RL training. In this work, we reveal that the clipping mechanism in PPO and GRPO induces biases on entropy. Through theoretical and empirical analyses, we show that clip-low increases entropy, while clip-high decreases it. Further, under standard clipping parameters, the effect of clip-high dominates, resulting in an overall entropy reduction even when purely random rewards are provided to the RL algorithm. Our findings highlight an overlooked confounding factor in RLVR: independent of the reward signal, the clipping mechanism influences entropy, which in turn affects the reasoning behavior. Furthermore, our analysis demonstrates that clipping can be deliberately used to control entropy. Specifically, with a more aggressive clip-low value, one can increase entropy, promote exploration, and ultimately prevent entropy collapse in RLVR training.

Clip-Low Increases Entropy and Clip-High Decreases Entropy in Reinforcement Learning of Large Language Models

TL;DR

This work reveals that clipping in PPO/GRPO introduces entropy biases during RLVR for large language models: clip-low raises policy entropy while clip-high lowers it, with the high clipping effect often dominating and causing entropy collapse under standard settings. Through a toy analysis with random rewards and empirical RLVR experiments on math-reasoning tasks, the authors demonstrate that entropy can be controlled by tuning the asymmetric clipping parameters, enabling sustained exploration without sacrificing performance. The findings provide a mechanistic understanding of entropy dynamics in RLVR and offer a practical clipping-based tool to prevent entropy collapse, potentially improving long-horizon training stability and reasoning capabilities. Overall, the paper shows that deliberate entropy management via clipping can maintain exploration, support continued learning, and improve pass@k metrics in LLM reasoning tasks.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has recently emerged as the leading approach for enhancing the reasoning capabilities of large language models (LLMs). However, RLVR is prone to entropy collapse, where the LLM quickly converges to a near-deterministic form, hindering exploration and progress during prolonged RL training. In this work, we reveal that the clipping mechanism in PPO and GRPO induces biases on entropy. Through theoretical and empirical analyses, we show that clip-low increases entropy, while clip-high decreases it. Further, under standard clipping parameters, the effect of clip-high dominates, resulting in an overall entropy reduction even when purely random rewards are provided to the RL algorithm. Our findings highlight an overlooked confounding factor in RLVR: independent of the reward signal, the clipping mechanism influences entropy, which in turn affects the reasoning behavior. Furthermore, our analysis demonstrates that clipping can be deliberately used to control entropy. Specifically, with a more aggressive clip-low value, one can increase entropy, promote exploration, and ultimately prevent entropy collapse in RLVR training.

Paper Structure

This paper contains 31 sections, 2 theorems, 46 equations, 10 figures, 1 table.

Key Result

Theorem 1

Consider the setup described in Section ss:random-reward-formulation and the policy gradient algorithm given by Equation eq:PG. Then, the change in entropy at state $s$ admits the first-order approximation where $Q = \pi_k(a\,|\,s) (\log \pi_k(a\,|\,s) + \mathcal{H}(\theta^k\,|\,s))$, $p_k = \mathbb{P}(X_k)$, $q_k = \mathbb{P}(Y_k)$, $d^{\pi_{old}}$ is the state visitation measure, and the expect

Figures (10)

  • Figure 1: Empirical estimates of $\mathbb{E}[Q]-\mathbb{E}[Q\,|\,X_k]$ and $\mathbb{E}[Q]-\mathbb{E}[Q\,|\,Y_k]$ throughout RL training with random rewards for (left)Qwen2.5-1.5B-Instruct and (right)Llama3.2-1B-Instruct. We observe that the values are always positive.
  • Figure 2: Estimated values of \ref{['eq:log-sign']} throughout RL training with random rewards averaged over 3 runs. (Left)Qwen2.5-1.5B-Instruct and (right)Llama3.2-1.5B-Instruct. We observe that the values are always positive.
  • Figure 3: Change of policy entropy during RL training the Qwen2.5-1.5B-Instruct model with random rewards with different clipping settings. We observe that both clip-high and clip-low influence the entropy, consistent with our theoretical predictions.
  • Figure 4: (Left) Entropy change of different base models when trained with random rewards under symmetric clipping $\varepsilon_{\mathrm{low}} = \varepsilon_{\mathrm{high}}$. (Right) Entropy change of Qwen2.5-1.5B-Instruct model with random rewards sampled from various probability distributions. Details of the experiments are provided in Appendix \ref{['appendix:subsec:random_reward_ablation']}.
  • Figure 5: Entropy change during (true reward) RLVR with GSM8K and Qwen2.5-3B-Instruct. (Left) Ablating the clipping mechanisms. (Right) Controlling entropy without clip-high. The clip-low value $\varepsilon_{\mathrm{low}}=0.15$ balances entropy, preventing entropy collapse and entropy explosion.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Theorem 2
  • proof
  • proof