Table of Contents
Fetching ...

Arbitrary Entropy Policy Optimization Breaks The Exploration Bottleneck of Reinforcement Learning

Chen Wang, Zhaochun Li, Jionghao Bai, Yuzhi Zhang, Shisheng Cui, Zhou Zhao, Yue Wang

TL;DR

This work tackles entropy collapse in GRPO during RL fine-tuning of LLMs, which hampers exploration and the discovery of new reasoning strategies. It introduces AEPO, a framework that regulates policy entropy not by fixed entropy bonuses but through a policy-gradient regularization on temperature-adjusted samples, coupled with a temperature-based entropy target. Theoretical analysis shows how high-temperature REINFORCE can modulate entropy and that the approach yields bidirectional control around a target entropy, while empirical results across diverse mathematical reasoning benchmarks show AEPO outperforms GRPO and entropy-based baselines and even surpasses the base model on very large pass@k measures. The paper also presents ablations demonstrating the necessity of temperature regulation and REINFORCE regularization, and argues that entropy–exploration–performance are linked in a non-monotonic fashion, highlighting practical implications for designing RL signals to expand the reasoning frontier of LLMs.

Abstract

Reinforcement Learning (RL) is essential for enhancing the reasoning capabilities of large language models (LLMs), yet the widely adopted Group Relative Policy Optimization (GRPO) suffers from entropy collapse, causing exploration to vanish and policies to converge prematurely. As a result, RL is widely believed to be incapable of expanding the reasoning frontier of LLMs. Existing entropy-regularized methods introduce an inevitable trade-off between reward and entropy, leading to exploration accompanied by non-negligible optimization bias. In this work, we prove that temperature-guided REINFORCE can modulate policy entropy, and propose Arbitrary Entropy Policy Optimization (AEPO), which reformulates entropy regularization as a policy-gradient optimization problem. Rather than manipulating entropy directly, AEPO implicitly regulates it by applying a REINFORCE regularization term on temperature-adjusted samples, ensuring that entropy is controlled but never dominates optimization, thereby enabling arbitrary and principled entropy regulation. Experiments show that AEPO outperforms RL baselines on both pass@1 and pass@$k$, and even surpasses the base model on pass@1024. By modulating entropy precisely, AEPO achieves more effective optimization dynamics and provides direct empirical evidence that entropy, exploration, and performance are intrinsically linked.

Arbitrary Entropy Policy Optimization Breaks The Exploration Bottleneck of Reinforcement Learning

TL;DR

This work tackles entropy collapse in GRPO during RL fine-tuning of LLMs, which hampers exploration and the discovery of new reasoning strategies. It introduces AEPO, a framework that regulates policy entropy not by fixed entropy bonuses but through a policy-gradient regularization on temperature-adjusted samples, coupled with a temperature-based entropy target. Theoretical analysis shows how high-temperature REINFORCE can modulate entropy and that the approach yields bidirectional control around a target entropy, while empirical results across diverse mathematical reasoning benchmarks show AEPO outperforms GRPO and entropy-based baselines and even surpasses the base model on very large pass@k measures. The paper also presents ablations demonstrating the necessity of temperature regulation and REINFORCE regularization, and argues that entropy–exploration–performance are linked in a non-monotonic fashion, highlighting practical implications for designing RL signals to expand the reasoning frontier of LLMs.

Abstract

Reinforcement Learning (RL) is essential for enhancing the reasoning capabilities of large language models (LLMs), yet the widely adopted Group Relative Policy Optimization (GRPO) suffers from entropy collapse, causing exploration to vanish and policies to converge prematurely. As a result, RL is widely believed to be incapable of expanding the reasoning frontier of LLMs. Existing entropy-regularized methods introduce an inevitable trade-off between reward and entropy, leading to exploration accompanied by non-negligible optimization bias. In this work, we prove that temperature-guided REINFORCE can modulate policy entropy, and propose Arbitrary Entropy Policy Optimization (AEPO), which reformulates entropy regularization as a policy-gradient optimization problem. Rather than manipulating entropy directly, AEPO implicitly regulates it by applying a REINFORCE regularization term on temperature-adjusted samples, ensuring that entropy is controlled but never dominates optimization, thereby enabling arbitrary and principled entropy regulation. Experiments show that AEPO outperforms RL baselines on both pass@1 and pass@, and even surpasses the base model on pass@1024. By modulating entropy precisely, AEPO achieves more effective optimization dynamics and provides direct empirical evidence that entropy, exploration, and performance are intrinsically linked.

Paper Structure

This paper contains 17 sections, 4 theorems, 45 equations, 4 figures, 4 tables.

Key Result

Lemma 4.1

Higher temperature distributions globally correspond to higher policy entropy, while lower temperature corresponds to lower entropy.

Figures (4)

  • Figure 1: Entropy across five runs of AEPO. By adjusting only the parameter $\mathcal{H}$, entropy can be controlled at different levels.
  • Figure 2: Entropy dynamics under temperature-controlled REINFORCE. High-temperature REINFORCE increases entropy, promoting exploration, while low-temperature REINFORCE reduces entropy.
  • Figure 3: Comparison between entropy regularization and AEPO. Entropy regularization often drives optimization toward two extremes—collapse or explosion—while AEPO maintains entropy within a stable and optimal exploration range.
  • Figure 4: Entropy trajectories of AEPO compared with GRPO, Entropy-Reg, and Entropy-Adv. AEPO stabilizes entropy around a moderate level, demonstrating controllable and robust entropy regulation throughout training.

Theorems & Definitions (8)

  • Lemma 4.1
  • Theorem 4.3
  • Corollary 4.4
  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Lemma 2.5
  • proof