Arbitrary Entropy Policy Optimization Breaks The Exploration Bottleneck of Reinforcement Learning
Chen Wang, Zhaochun Li, Jionghao Bai, Yuzhi Zhang, Shisheng Cui, Zhou Zhao, Yue Wang
TL;DR
This work tackles entropy collapse in GRPO during RL fine-tuning of LLMs, which hampers exploration and the discovery of new reasoning strategies. It introduces AEPO, a framework that regulates policy entropy not by fixed entropy bonuses but through a policy-gradient regularization on temperature-adjusted samples, coupled with a temperature-based entropy target. Theoretical analysis shows how high-temperature REINFORCE can modulate entropy and that the approach yields bidirectional control around a target entropy, while empirical results across diverse mathematical reasoning benchmarks show AEPO outperforms GRPO and entropy-based baselines and even surpasses the base model on very large pass@k measures. The paper also presents ablations demonstrating the necessity of temperature regulation and REINFORCE regularization, and argues that entropy–exploration–performance are linked in a non-monotonic fashion, highlighting practical implications for designing RL signals to expand the reasoning frontier of LLMs.
Abstract
Reinforcement Learning (RL) is essential for enhancing the reasoning capabilities of large language models (LLMs), yet the widely adopted Group Relative Policy Optimization (GRPO) suffers from entropy collapse, causing exploration to vanish and policies to converge prematurely. As a result, RL is widely believed to be incapable of expanding the reasoning frontier of LLMs. Existing entropy-regularized methods introduce an inevitable trade-off between reward and entropy, leading to exploration accompanied by non-negligible optimization bias. In this work, we prove that temperature-guided REINFORCE can modulate policy entropy, and propose Arbitrary Entropy Policy Optimization (AEPO), which reformulates entropy regularization as a policy-gradient optimization problem. Rather than manipulating entropy directly, AEPO implicitly regulates it by applying a REINFORCE regularization term on temperature-adjusted samples, ensuring that entropy is controlled but never dominates optimization, thereby enabling arbitrary and principled entropy regulation. Experiments show that AEPO outperforms RL baselines on both pass@1 and pass@$k$, and even surpasses the base model on pass@1024. By modulating entropy precisely, AEPO achieves more effective optimization dynamics and provides direct empirical evidence that entropy, exploration, and performance are intrinsically linked.
