Table of Contents
Fetching ...

ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training

Chenliang Li, Adel Elmahdy, Alex Boyd, Zhongruo Wang, Alfredo Garcia, Parminder Bhatia, Taha Kass-Hout, Cao Xiao, Mingyi Hong

TL;DR

The paper addresses instability of PPO in multi-turn LLM agent training caused by granularity mismatch between token-level optimization and turn-level interactions, and high variance from off-policy critic estimates. It introduces Turn-PPO, which uses turn-level importance sampling for credit assignment, and ST-PPO, which combines turn-level IS with clipping-bias correction to stabilize updates. Experiments on multi-turn QA benchmarks (e.g., NQ, HotpotQA) and medical datasets show ST-PPO and S-PPO prevent training collapse, maintain lower clipping ratios, and outperform token-level PPO and baselines. The results demonstrate a practical and scalable framework for stabilizing reinforcement learning with multi-turn LLM agents by aligning optimization with the task structure and mitigating off-policy instability.

Abstract

PPO has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we identify two main sources of instability in this setting: (1)~token-level importance sampling, which is misaligned with the natural granularity of multi-turn environments that have distinct turn-level stages, and (2) inaccurate advantage estimates from off-policy samples, where the critic has not learned to evaluate certain state-action pairs, resulting in high-variance gradients and unstable updates. To address these challenges, we introduce two complementary stabilization techniques: (1) turn-level importance sampling, which aligns optimization with the natural structure of multi-turn reasoning, and (2) clipping-bias correction, which normalizes gradients by downweighting unreliable, highly off-policy samples. Depending on how these components are combined, we obtain three variants: Turn-PPO (turn-level sampling only), S-PPO (clipping-bias correction applied to token-level PPO), and ST-PPO (turn-level sampling combined with clipping-bias correction). In our experiments, we primarily study ST-PPO and S-PPO, which together demonstrate how the two stabilization mechanisms address complementary sources of instability. Experiments on multi-turn search tasks across general QA, multi-hop QA, and medical multiple-choice QA benchmarks show that ST-PPO and S-PPO consistently prevent the performance collapses observed in large-model training, maintain lower clipping ratios throughout optimization, and achieve higher task performance than standard token-level PPO. These results demonstrate that combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training.

ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training

TL;DR

The paper addresses instability of PPO in multi-turn LLM agent training caused by granularity mismatch between token-level optimization and turn-level interactions, and high variance from off-policy critic estimates. It introduces Turn-PPO, which uses turn-level importance sampling for credit assignment, and ST-PPO, which combines turn-level IS with clipping-bias correction to stabilize updates. Experiments on multi-turn QA benchmarks (e.g., NQ, HotpotQA) and medical datasets show ST-PPO and S-PPO prevent training collapse, maintain lower clipping ratios, and outperform token-level PPO and baselines. The results demonstrate a practical and scalable framework for stabilizing reinforcement learning with multi-turn LLM agents by aligning optimization with the task structure and mitigating off-policy instability.

Abstract

PPO has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we identify two main sources of instability in this setting: (1)~token-level importance sampling, which is misaligned with the natural granularity of multi-turn environments that have distinct turn-level stages, and (2) inaccurate advantage estimates from off-policy samples, where the critic has not learned to evaluate certain state-action pairs, resulting in high-variance gradients and unstable updates. To address these challenges, we introduce two complementary stabilization techniques: (1) turn-level importance sampling, which aligns optimization with the natural structure of multi-turn reasoning, and (2) clipping-bias correction, which normalizes gradients by downweighting unreliable, highly off-policy samples. Depending on how these components are combined, we obtain three variants: Turn-PPO (turn-level sampling only), S-PPO (clipping-bias correction applied to token-level PPO), and ST-PPO (turn-level sampling combined with clipping-bias correction). In our experiments, we primarily study ST-PPO and S-PPO, which together demonstrate how the two stabilization mechanisms address complementary sources of instability. Experiments on multi-turn search tasks across general QA, multi-hop QA, and medical multiple-choice QA benchmarks show that ST-PPO and S-PPO consistently prevent the performance collapses observed in large-model training, maintain lower clipping ratios throughout optimization, and achieve higher task performance than standard token-level PPO. These results demonstrate that combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training.

Paper Structure

This paper contains 18 sections, 2 theorems, 23 equations, 6 figures, 2 tables, 1 algorithm.

Key Result

Lemma 4.1

The gradient of the objective function in Eq. eq:turn_level_PPO is given by where $|y^k| = t_k^{\text{end}} - t_k^{\text{start}} + 1$ and $\hat{A}^k \coloneqq \sum_{t=t_k^{\text{start}}}^{t_k^{\text{end}}} \mathds{1}_{\{(k,t) \in \mathcal{B}_{\text{turn}} \}} \ \hat{A}_{t}$.

Figures (6)

  • Figure 1: Illustration of the four PPO variants. Token-level PPO becomes Turn-level PPO by applying turn-level importance sampling (Eq. \ref{['eq:turn_level_IS']}). Further adding the clipping bias to normalize gradients yields S-PPO and ST-PPO (Eq. \ref{['eq:S_PPO']} and Eq. \ref{['eq:ST_PPO']}). Both variants significantly reduce the probability of extreme gradient spikes, leading to more stable training.
  • Figure 2: Observations from a failed run with Qwen2.5-7B base model when running token-level PPO. From left to right, we show the estimated advantage, the ratio of valid actions (whether the tool is successfully called), the L2 norm of the policy gradient, and the success rate of the search task. Each metric is recorded for every training batch.
  • Figure 3: Comparison of token-level versus turn-level PPO training on Qwen2.5-1.5B for the search task (results averaged over 5 runs). (a) Success rates demonstrate that turn-level PPO outperforms compared to token-level PPO. (b) L2 norms of policy gradients show that turn-level PPO exhibits greater training stability. (c–d) Additional diagnostic metrics for token-level PPO: (c) the L2 norm of the PPO loss function remains stable throughout training due to gradient clipping, while (d) the L2 norm of the clipping bias term grows exponentially over time, validating Lemma \ref{['lm:ppo_grad_decomp']}.
  • Figure 4: Experimental results of Qwen-2.5-7B policy models, with the value model also trained from Qwen-2.5-7B. Results are averaged over three trials. We report the average success rate on the NQ and HotpotQA datasets.
  • Figure 5: We also report (a) the clipping ratio and (b) the KL divergence during policy optimization. Both ST-PPO and S-PPO achieve lower clipping ratios and KL divergence compared to vanilla PPO, indicating more stable training dynamics. For GRPO, we do not report values beyond the 160th step because the algorithm collapses around that point, after which the metrics become NaN.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Lemma 4.1
  • proof
  • Lemma 4.2
  • proof