Table of Contents
Fetching ...

Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement

Weimin Xiong, Yifan Song, Xiutian Zhao, Wenhao Wu, Xun Wang, Ke Wang, Cheng Li, Wei Peng, Sujian Li

TL;DR

The paper introduces Iterative step-level Process Refinement (IPR), a framework that injects granular, step-level supervision into LLM agent training by estimating step-level rewards via Monte Carlo sampling and constructing contrastive action pairs. Through supervised fine-tuning, step-level reward acquisition, and iterative optimization with mixture losses (outcome-DPO, step-DPO, and SFT), IPR refines agent behavior along long trajectories. Experiments on WebShop, InterCodeSQL, and ALFWorld show consistent improvements over prompt-based, outcome refinement, and process refinement baselines, with notable gains in action efficiency and robustness to unseen tasks. Analyses demonstrate the value of step-level supervision across base models, the impact of ablations, and the potential of a learned step reward model to accelerate training. Overall, IPR provides a practical, scalable path for enhancing LLM agents by leveraging fine-grained process information during learning.

Abstract

Large language model agents have exhibited exceptional performance across a range of complex interactive tasks. Recent approaches have utilized tuning with expert trajectories to enhance agent performance, yet they primarily concentrate on outcome rewards, which may lead to errors or suboptimal actions due to the absence of process supervision signals. In this paper, we introduce the Iterative step-level Process Refinement (IPR) framework, which provides detailed step-by-step guidance to enhance agent training. Specifically, we adopt the Monte Carlo method to estimate step-level rewards. During each iteration, the agent explores along the expert trajectory and generates new actions. These actions are then evaluated against the corresponding step of expert trajectory using step-level rewards. Such comparison helps identify discrepancies, yielding contrastive action pairs that serve as training data for the agent. Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines. Moreover, our analytical findings highlight the effectiveness of IPR in augmenting action efficiency and its applicability to diverse models.

Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement

TL;DR

The paper introduces Iterative step-level Process Refinement (IPR), a framework that injects granular, step-level supervision into LLM agent training by estimating step-level rewards via Monte Carlo sampling and constructing contrastive action pairs. Through supervised fine-tuning, step-level reward acquisition, and iterative optimization with mixture losses (outcome-DPO, step-DPO, and SFT), IPR refines agent behavior along long trajectories. Experiments on WebShop, InterCodeSQL, and ALFWorld show consistent improvements over prompt-based, outcome refinement, and process refinement baselines, with notable gains in action efficiency and robustness to unseen tasks. Analyses demonstrate the value of step-level supervision across base models, the impact of ablations, and the potential of a learned step reward model to accelerate training. Overall, IPR provides a practical, scalable path for enhancing LLM agents by leveraging fine-grained process information during learning.

Abstract

Large language model agents have exhibited exceptional performance across a range of complex interactive tasks. Recent approaches have utilized tuning with expert trajectories to enhance agent performance, yet they primarily concentrate on outcome rewards, which may lead to errors or suboptimal actions due to the absence of process supervision signals. In this paper, we introduce the Iterative step-level Process Refinement (IPR) framework, which provides detailed step-by-step guidance to enhance agent training. Specifically, we adopt the Monte Carlo method to estimate step-level rewards. During each iteration, the agent explores along the expert trajectory and generates new actions. These actions are then evaluated against the corresponding step of expert trajectory using step-level rewards. Such comparison helps identify discrepancies, yielding contrastive action pairs that serve as training data for the agent. Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines. Moreover, our analytical findings highlight the effectiveness of IPR in augmenting action efficiency and its applicability to diverse models.
Paper Structure (34 sections, 11 equations, 9 figures, 5 tables)

This paper contains 34 sections, 11 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Comparison of three different agent training paradigms. Green and red circles represent correct and incorrect actions, while check and cross marks indicate the final outcome. Compared to the other methods, IPR can provide step-level process supervision.
  • Figure 2: The overall architecture of IPR in a single iteration. The agent trained after SFT first explores new actions along the expert trajectory. Then we use the scorer to reward each step and construct contrastive action data. Finally we optimize the agent with a mixed loss.
  • Figure 3: Step reward estimation quality on WebShop.
  • Figure 4: The average reward per step.
  • Figure 5: Case study for WebShop.
  • ...and 4 more figures