Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs
Junbo Li, Peng Zhou, Rui Meng, Meet P. Vadera, Lihong Li, Yang Li
TL;DR
This work identifies fundamental instability in GRPO when applied to multi-turn agentic LLMs and demonstrates that PPO with a learnable critic offers more stable learning. By recasting the problem with a turn-level MDP, turn-PPO aligns credit assignment to whole turns, improving both stability and performance on long-horizon tasks like WebShop and Sokoban. The paper provides thorough ablations and training guidelines, showing that careful tuning of learning rates, batch diversity, and GAE parameters yields reliable improvements. Collectively, turn-PPO advances practical reinforcement learning for multi-turn LLMs, enabling more reliable long-horizon reasoning and tool use in interactive environments.
Abstract
Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi-turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi-turn scenarios, we introduce turn-PPO, a variant that operates on a turn-level MDP formulation, as opposed to the commonly used token-level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn-PPO, both with and without long reasoning components.
