Table of Contents
Fetching ...

Proximity-Based Multi-Turn Optimization: Practical Credit Assignment for LLM Agent Training

Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chang Liu, Peilin Zhao

TL;DR

ProxMO integrates global context via two lightweight mechanisms: success-rate-aware modulation dynamically adapts gradient intensity based on episode-level difficulty, while proximity-based soft aggregation derives baselines through continuous semantic weighting at the step level.

Abstract

Multi-turn LLM agents are becoming pivotal to production systems, spanning customer service automation, e-commerce assistance, and interactive task management, where accurately distinguishing high-value informative signals from stochastic noise is critical for sample-efficient training. In real-world scenarios, a failure in a trivial task may reflect random instability, whereas success in a high-difficulty task signifies a genuine capability breakthrough. Yet, existing group-based policy optimization methods rigidly rely on statistical deviation within discrete batches, frequently misallocating credit when task difficulty fluctuates. To address this issue, we propose Proximity-based Multi-turn Optimization (ProxMO), a practical and robust framework engineered specifically for the constraints of real-world deployment. ProxMO integrates global context via two lightweight mechanisms: success-rate-aware modulation dynamically adapts gradient intensity based on episode-level difficulty, while proximity-based soft aggregation derives baselines through continuous semantic weighting at the step level. Extensive evaluations on ALFWorld and WebShop benchmarks demonstrate that ProxMO yields substantial performance gains over existing baselines with negligible computational cost. Ablation studies further validate the independent and synergistic efficacy of both mechanisms. Crucially, ProxMO offers plug-and-play compatibility with standard GRPO frameworks, facilitating immediate, low-friction adoption in existing industrial training pipelines. Our implementation is available at: \href{https://anonymous.4open.science/r/proxmo-B7E7/README.md}{https://anonymous.4open.science/r/proxmo}.

Proximity-Based Multi-Turn Optimization: Practical Credit Assignment for LLM Agent Training

TL;DR

ProxMO integrates global context via two lightweight mechanisms: success-rate-aware modulation dynamically adapts gradient intensity based on episode-level difficulty, while proximity-based soft aggregation derives baselines through continuous semantic weighting at the step level.

Abstract

Multi-turn LLM agents are becoming pivotal to production systems, spanning customer service automation, e-commerce assistance, and interactive task management, where accurately distinguishing high-value informative signals from stochastic noise is critical for sample-efficient training. In real-world scenarios, a failure in a trivial task may reflect random instability, whereas success in a high-difficulty task signifies a genuine capability breakthrough. Yet, existing group-based policy optimization methods rigidly rely on statistical deviation within discrete batches, frequently misallocating credit when task difficulty fluctuates. To address this issue, we propose Proximity-based Multi-turn Optimization (ProxMO), a practical and robust framework engineered specifically for the constraints of real-world deployment. ProxMO integrates global context via two lightweight mechanisms: success-rate-aware modulation dynamically adapts gradient intensity based on episode-level difficulty, while proximity-based soft aggregation derives baselines through continuous semantic weighting at the step level. Extensive evaluations on ALFWorld and WebShop benchmarks demonstrate that ProxMO yields substantial performance gains over existing baselines with negligible computational cost. Ablation studies further validate the independent and synergistic efficacy of both mechanisms. Crucially, ProxMO offers plug-and-play compatibility with standard GRPO frameworks, facilitating immediate, low-friction adoption in existing industrial training pipelines. Our implementation is available at: \href{https://anonymous.4open.science/r/proxmo-B7E7/README.md}{https://anonymous.4open.science/r/proxmo}.
Paper Structure (47 sections, 25 equations, 32 figures, 4 tables)

This paper contains 47 sections, 25 equations, 32 figures, 4 tables.

Figures (32)

  • Figure 1: Motivating challenges in multi-turn policy optimization. (a) Context-blind normalization: identical z-score magnitudes yield uniform advantage intensities across high-success (e.g., 75%) and low-success (e.g., 25%) groups, ignoring informational heterogeneity. (b) Hard boundary partitioning: binary participation (in/out based on threshold) with equal intra-group weighting, causing singleton degeneracy under strict criteria or incorrect equal weighting under loose criteria.
  • Figure 2: The overview of ProxMO. Episode-level: success-rate-aware modulation adapts credit to task difficulty (i.e., $p$). Step-level: proximity-based soft aggregation eliminates discrete boundaries for robust baseline estimation.
  • Figure 3: Hyperparameter sensitivity analysis on ALFWorld and WebShop, with ProxMO maintaining stable high performance across broad parameter configurations. For clarity, temperature $\tau$ is visualized on a logarithmic scale.
  • Figure 4: Ablation study on ALFWorld, where ProxMO outperforms all variants and the strong baseline GiGPO.
  • Figure 5: Training time comparison on ALFWorld (shaded regions denote confidence intervals) reveals ProxMO incurs a minimal additional overhead (+1.09%) versus GRPO across training iterations, confirming its computational efficiency for scalable pipelines.
  • ...and 27 more figures