ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training

Chenliang Li; Adel Elmahdy; Alex Boyd; Zhongruo Wang; Alfredo Garcia; Parminder Bhatia; Taha Kass-Hout; Cao Xiao; Mingyi Hong

ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training

Chenliang Li, Adel Elmahdy, Alex Boyd, Zhongruo Wang, Alfredo Garcia, Parminder Bhatia, Taha Kass-Hout, Cao Xiao, Mingyi Hong

TL;DR

The paper addresses instability of PPO in multi-turn LLM agent training caused by granularity mismatch between token-level optimization and turn-level interactions, and high variance from off-policy critic estimates. It introduces Turn-PPO, which uses turn-level importance sampling for credit assignment, and ST-PPO, which combines turn-level IS with clipping-bias correction to stabilize updates. Experiments on multi-turn QA benchmarks (e.g., NQ, HotpotQA) and medical datasets show ST-PPO and S-PPO prevent training collapse, maintain lower clipping ratios, and outperform token-level PPO and baselines. The results demonstrate a practical and scalable framework for stabilizing reinforcement learning with multi-turn LLM agents by aligning optimization with the task structure and mitigating off-policy instability.

Abstract

PPO has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we identify two main sources of instability in this setting: (1)~token-level importance sampling, which is misaligned with the natural granularity of multi-turn environments that have distinct turn-level stages, and (2) inaccurate advantage estimates from off-policy samples, where the critic has not learned to evaluate certain state-action pairs, resulting in high-variance gradients and unstable updates. To address these challenges, we introduce two complementary stabilization techniques: (1) turn-level importance sampling, which aligns optimization with the natural structure of multi-turn reasoning, and (2) clipping-bias correction, which normalizes gradients by downweighting unreliable, highly off-policy samples. Depending on how these components are combined, we obtain three variants: Turn-PPO (turn-level sampling only), S-PPO (clipping-bias correction applied to token-level PPO), and ST-PPO (turn-level sampling combined with clipping-bias correction). In our experiments, we primarily study ST-PPO and S-PPO, which together demonstrate how the two stabilization mechanisms address complementary sources of instability. Experiments on multi-turn search tasks across general QA, multi-hop QA, and medical multiple-choice QA benchmarks show that ST-PPO and S-PPO consistently prevent the performance collapses observed in large-model training, maintain lower clipping ratios throughout optimization, and achieve higher task performance than standard token-level PPO. These results demonstrate that combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training.

ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training

TL;DR

Abstract

ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (4)