Table of Contents
Fetching ...

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Junbo Li, Peng Zhou, Rui Meng, Meet P. Vadera, Lihong Li, Yang Li

TL;DR

This work identifies fundamental instability in GRPO when applied to multi-turn agentic LLMs and demonstrates that PPO with a learnable critic offers more stable learning. By recasting the problem with a turn-level MDP, turn-PPO aligns credit assignment to whole turns, improving both stability and performance on long-horizon tasks like WebShop and Sokoban. The paper provides thorough ablations and training guidelines, showing that careful tuning of learning rates, batch diversity, and GAE parameters yields reliable improvements. Collectively, turn-PPO advances practical reinforcement learning for multi-turn LLMs, enabling more reliable long-horizon reasoning and tool use in interactive environments.

Abstract

Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi-turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi-turn scenarios, we introduce turn-PPO, a variant that operates on a turn-level MDP formulation, as opposed to the commonly used token-level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn-PPO, both with and without long reasoning components.

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

TL;DR

This work identifies fundamental instability in GRPO when applied to multi-turn agentic LLMs and demonstrates that PPO with a learnable critic offers more stable learning. By recasting the problem with a turn-level MDP, turn-PPO aligns credit assignment to whole turns, improving both stability and performance on long-horizon tasks like WebShop and Sokoban. The paper provides thorough ablations and training guidelines, showing that careful tuning of learning rates, batch diversity, and GAE parameters yields reliable improvements. Collectively, turn-PPO advances practical reinforcement learning for multi-turn LLMs, enabling more reliable long-horizon reasoning and tool use in interactive environments.

Abstract

Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi-turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi-turn scenarios, we introduce turn-PPO, a variant that operates on a turn-level MDP formulation, as opposed to the commonly used token-level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn-PPO, both with and without long reasoning components.

Paper Structure

This paper contains 33 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Comparison of advantage computation in GRPO, token-PPO, and turn-PPO. In turn-PPO, the state is defined as $s_n := \left(\oplus_{n'<n}(Q_{n'}, R_{n'})\right) \oplus Q_n$ and the action as $a_n := R_n$. For the critic in token-PPO and turn-PPO, the position of $\hat{V}_h$ in the figure indicates that it is conditioned on all tokens up to that point.
  • Figure 2: In the first two plots, we show GRPO validation reward curves during training on Webshop and Sokoban for Qwen2.5 and Qwen3. In the third one, we show rewards for GRPO and its variants with respect to std dev, KL, and batch size diversity. In the last one, we show evolution of standard deviation throughout training.
  • Figure 3: Comparison of turn-PPO and token-PPO in mean reward across multiple settings and and clipping ratio for Sokoban. Turn-PPO shows superior performance in most settings, highlighting the benefit of turn-level advantage estimation.
  • Figure 4: Ablation studies on (left) number of diverse samples in a batch, (middle) discount factor $\gamma$, and (right) bias–variance trade-off parameter $\lambda$, showing their impact on mean reward using WebShop and Qwen3 with reasoning.