Clip-Low Increases Entropy and Clip-High Decreases Entropy in Reinforcement Learning of Large Language Models
Jaesung R. Park, Junsu Kim, Gyeongman Kim, Jinyoung Jo, Sean Choi, Jaewoong Cho, Ernest K. Ryu
TL;DR
This work reveals that clipping in PPO/GRPO introduces entropy biases during RLVR for large language models: clip-low raises policy entropy while clip-high lowers it, with the high clipping effect often dominating and causing entropy collapse under standard settings. Through a toy analysis with random rewards and empirical RLVR experiments on math-reasoning tasks, the authors demonstrate that entropy can be controlled by tuning the asymmetric clipping parameters, enabling sustained exploration without sacrificing performance. The findings provide a mechanistic understanding of entropy dynamics in RLVR and offer a practical clipping-based tool to prevent entropy collapse, potentially improving long-horizon training stability and reasoning capabilities. Overall, the paper shows that deliberate entropy management via clipping can maintain exploration, support continued learning, and improve pass@k metrics in LLM reasoning tasks.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has recently emerged as the leading approach for enhancing the reasoning capabilities of large language models (LLMs). However, RLVR is prone to entropy collapse, where the LLM quickly converges to a near-deterministic form, hindering exploration and progress during prolonged RL training. In this work, we reveal that the clipping mechanism in PPO and GRPO induces biases on entropy. Through theoretical and empirical analyses, we show that clip-low increases entropy, while clip-high decreases it. Further, under standard clipping parameters, the effect of clip-high dominates, resulting in an overall entropy reduction even when purely random rewards are provided to the RL algorithm. Our findings highlight an overlooked confounding factor in RLVR: independent of the reward signal, the clipping mechanism influences entropy, which in turn affects the reasoning behavior. Furthermore, our analysis demonstrates that clipping can be deliberately used to control entropy. Specifically, with a more aggressive clip-low value, one can increase entropy, promote exploration, and ultimately prevent entropy collapse in RLVR training.
