EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning
Wujiang Xu, Wentian Zhao, Zhenting Wang, Yu-Jhe Li, Can Jin, Mingyu Jin, Kai Mei, Kun Wan, Dimitris N. Metaxas
TL;DR
This work tackles the challenge of training large language model (LLM) agents in multi-turn, sparse-reward environments, where conventional reinforcement learning often suffers an exploration-exploitation cascade. The authors introduce Entropy-regularized Policy Optimization (EPO), a general on-policy framework that couples trajectory-aware entropy with an entropy smoothing regularizer anchored to historical entropy and an adaptive phase-based weighting to stabilize exploration across training phases. Empirical results on ScienceWorld and ALFWorld show substantial gains (up to 152% and 19.8%, respectively) and markedly improved training stability, transforming previously intractable tasks into convergent optimization problems. The work demonstrates that effective entropy control in multi-turn settings requires temporal awareness and history-based regulation, with broad implications for training LLM agents in long-horizon, sparse-reward domains.
Abstract
Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.
