Table of Contents
Fetching ...

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Wujiang Xu, Wentian Zhao, Zhenting Wang, Yu-Jhe Li, Can Jin, Mingyu Jin, Kai Mei, Kun Wan, Dimitris N. Metaxas

TL;DR

This work tackles the challenge of training large language model (LLM) agents in multi-turn, sparse-reward environments, where conventional reinforcement learning often suffers an exploration-exploitation cascade. The authors introduce Entropy-regularized Policy Optimization (EPO), a general on-policy framework that couples trajectory-aware entropy with an entropy smoothing regularizer anchored to historical entropy and an adaptive phase-based weighting to stabilize exploration across training phases. Empirical results on ScienceWorld and ALFWorld show substantial gains (up to 152% and 19.8%, respectively) and markedly improved training stability, transforming previously intractable tasks into convergent optimization problems. The work demonstrates that effective entropy control in multi-turn settings requires temporal awareness and history-based regulation, with broad implications for training LLM agents in long-horizon, sparse-reward domains.

Abstract

Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

TL;DR

This work tackles the challenge of training large language model (LLM) agents in multi-turn, sparse-reward environments, where conventional reinforcement learning often suffers an exploration-exploitation cascade. The authors introduce Entropy-regularized Policy Optimization (EPO), a general on-policy framework that couples trajectory-aware entropy with an entropy smoothing regularizer anchored to historical entropy and an adaptive phase-based weighting to stabilize exploration across training phases. Empirical results on ScienceWorld and ALFWorld show substantial gains (up to 152% and 19.8%, respectively) and markedly improved training stability, transforming previously intractable tasks into convergent optimization problems. The work demonstrates that effective entropy control in multi-turn settings requires temporal awareness and history-based regulation, with broad implications for training LLM agents in long-horizon, sparse-reward domains.

Abstract

Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.

Paper Structure

This paper contains 39 sections, 5 theorems, 49 equations, 8 figures, 3 tables, 2 algorithms.

Key Result

Lemma 1

For any two policies $\pi$ and $\pi'$, and any state $s_0$:

Figures (8)

  • Figure 1: The exploration-exploitation cascade failure in multi-turn agent training. The cascade failure manifests in two distinct phases clearly visible in the figure: (1) Phase 1 - Excessive Early Exploration (0-40 steps): PPO's early trajectory steps (pink dashed line) exhibit rapid, uncontrolled entropy growth that creates unstable behavioral foundations, while rewards remain stagnant, indicating ineffective exploration-to-reward conversion. (2) Phase 2 - Uncertainty Propagation (40-120 steps): The instability from early steps cascades to late trajectory steps (red dotted line), maintaining dangerously high entropy oscillations that prevent coherent strategy formation and result in reward plateaus despite continuous exploration. This two-phase pattern demonstrates why standard entropy methods fail in multi-turn environments. In contrast, our EPO method maintains stable, controlled entropy levels across both early and late trajectory steps throughout training, achieving significantly lower final entropy values and consistent reward improvement, preventing the cascade failure.
  • Figure 2: Training dynamics and generalization performance analysis. We present the evolution of training rewards and validation success rates across both in-distribution (IID) and out-of-distribution (OOD) evaluation settings. (a-c) ScienceWorld experimental results contrasting PPO and PPO+EPO performance across training reward accumulation, IID validation, and OOD validation metrics. (d-f) ALFWorld experimental results contrasting GRPO and GRPO+EPO under identical evaluation criteria. Our EPO enhancement exhibits significantly improved training stability and substantial performance gains across both IID and OOD evaluation scenarios against baseline methods.
  • Figure 3: Ablation studies on entropy regularization components. (a-c) ScienceWorld comparison of EPO versus EPO-Base without entropy smoothing, demonstrating that smoothing is essential for stable convergence in sparse reward settings. (d-f) ALFWorld comparison of EPO with dynamic $\beta_k$ versus EPO W/O DW using constant $\beta$, showing that adaptive weighting significantly accelerates early training progress.
  • Figure 4: Model studies on ScienceWorld. (a) Consistent entropy regularization outperforms decaying schedules, preventing cascade failure. (b) EPO achieves near-perfect success while EA plateaus at 0.5--0.6 due to reasoning degradation. (c) Decay schedules prematurely suppress crucial early-turn exploration, triggering late-stage uncertainty propagation.
  • Figure 5: Training dynamics and generalization performance analysis. We present the evolution of training rewards and validation success rates across both in-distribution (IID) and out-of-distribution (OOD) evaluation settings. (a,c,e) ScienceWorld experimental results contrasting GRPO and GRPO+EPO performance across training reward accumulation, IID validation, and OOD validation metrics. (b,d,f) ALFWorld experimental results contrasting PPO and PPO+EPO under identical evaluation criteria. Our EPO enhancement exhibits significantly improved training stability and substantial performance gains across both IID and OOD evaluation scenarios against baseline methods.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Lemma 1: Performance Difference
  • Lemma 2: Entropy Gradient
  • Lemma 3: Entropy Bias
  • Lemma 4: Performance Bound under Gradient Norm
  • Definition 5: EPO Objective
  • Proposition 6: Improved Performance Bound with EPO
  • proof
  • proof