Table of Contents
Fetching ...

Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

Zhezheng Hao, Hong Wang, Haoyang Liu, Jian Luo, Jiarui Yu, Hande Dong, Qiang Lin, Can Wang, Jiawei Chen

TL;DR

This work addresses entropy collapse in Reinforcement Learning with Verifiable Rewards (RLVR) for large language models by reframing entropy control as token-level entropy-change dynamics. It derives a first-order estimator $\Omega_{i,t}$ linking per-token updates to entropy changes and uses a quadrant-based analysis to interpret existing interventions such as ratio clipping and sample weighting. Motivated by these insights, it proposes STEER, an adaptive token-level reweighting scheme $\lambda_{i,t} = \exp(-k |\Omega_{i,t}|)$ that keeps per-step entropy change within a moderate band, thereby mitigating collapse while preserving learning. Empirical results on mathematical reasoning benchmarks show STEER achieving stronger downstream performance, improved training stability, and robust behavior in extreme settings, highlighting a practical path to more reliable RLVR training of LLMs.

Abstract

While Reinforcement Learning with Verifiable Rewards (RLVR) can enhance LLM reasoning, its training process poses a critical risk: entropy collapse. This phenomenon is a rapid loss of policy diversity, stemming from the exploration-exploitation imbalance and leading to a lack of generalization. Recent entropy-intervention methods aim to prevent \coloredtext{entropy collapse}, yet their underlying mechanisms remain unclear. In this paper, we conduct a quantitative analysis to reveal token-level entropy changes and how existing entropy intervention methods help avoid entropy collapse. Our findings point out a fundamental limitation of existing methods: they attempt to control entropy dynamics indirectly. By only affecting related factors, such as the advantage signal and generation probability, their effectiveness is inherently limited and could potentially fail. To address this limitation, we introduce an entropy-change-aware reweighting scheme, namely Stabilizing Token-level Entropy-changE via Reweighting (STEER), that adaptively stabilizes entropy dynamics through fine-grained token-level adjustments. Our approach mitigates over-exploitation while fostering robust exploration. Extensive experiments demonstrate that STEER significantly mitigates entropy collapse, stabilizes entropy dynamics, and achieves stronger downstream performance across various mathematical reasoning benchmarks \footnote{Our code is available at https://github.com/zz-haooo/STEER.

Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

TL;DR

This work addresses entropy collapse in Reinforcement Learning with Verifiable Rewards (RLVR) for large language models by reframing entropy control as token-level entropy-change dynamics. It derives a first-order estimator linking per-token updates to entropy changes and uses a quadrant-based analysis to interpret existing interventions such as ratio clipping and sample weighting. Motivated by these insights, it proposes STEER, an adaptive token-level reweighting scheme that keeps per-step entropy change within a moderate band, thereby mitigating collapse while preserving learning. Empirical results on mathematical reasoning benchmarks show STEER achieving stronger downstream performance, improved training stability, and robust behavior in extreme settings, highlighting a practical path to more reliable RLVR training of LLMs.

Abstract

While Reinforcement Learning with Verifiable Rewards (RLVR) can enhance LLM reasoning, its training process poses a critical risk: entropy collapse. This phenomenon is a rapid loss of policy diversity, stemming from the exploration-exploitation imbalance and leading to a lack of generalization. Recent entropy-intervention methods aim to prevent \coloredtext{entropy collapse}, yet their underlying mechanisms remain unclear. In this paper, we conduct a quantitative analysis to reveal token-level entropy changes and how existing entropy intervention methods help avoid entropy collapse. Our findings point out a fundamental limitation of existing methods: they attempt to control entropy dynamics indirectly. By only affecting related factors, such as the advantage signal and generation probability, their effectiveness is inherently limited and could potentially fail. To address this limitation, we introduce an entropy-change-aware reweighting scheme, namely Stabilizing Token-level Entropy-changE via Reweighting (STEER), that adaptively stabilizes entropy dynamics through fine-grained token-level adjustments. Our approach mitigates over-exploitation while fostering robust exploration. Extensive experiments demonstrate that STEER significantly mitigates entropy collapse, stabilizes entropy dynamics, and achieves stronger downstream performance across various mathematical reasoning benchmarks \footnote{Our code is available at https://github.com/zz-haooo/STEER.

Paper Structure

This paper contains 27 sections, 2 theorems, 18 equations, 23 figures, 6 tables.

Key Result

Theorem 1

(First–order entropy change) Let the policy model $\pi_\theta$ follow Assumption Parameter independence assumption. The change of conditional entropy between two update steps is defined as $\Delta \mathcal{H}_{it} \triangleq \mathcal{H}(\pi_{\theta}^{k+1} \mid s_{i,t}) - \mathcal{H}(\pi_{\theta}^{k} where $\eta$ is the learning rate, $w_{i,t}=\mathbb{I}_{\text{clip}}\, r_{i,t}\,A_{i,t}$ is per-tok

Figures (23)

  • Figure 1: Entropy change estimation in the first 10 training steps on Qwen2.5-7B and Qwen2.5-Math-7B. The curve compares estimated vs. ground-truth entropy change (left axis) and histograms show token counts per bin (right axis).
  • Figure 2: MSE, PCC and SRCC comparison.
  • Figure 3: Token-level entropy change indicator $\delta(a|s)$.
  • Figure 4: Entropy change with advantage and probability.
  • Figure 5: Key Considerations in Current Approaches.
  • ...and 18 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 1
  • proof