Table of Contents
Fetching ...

STeCa: Step-level Trajectory Calibration for LLM Agent Learning

Hanlin Wang, Jian Wang, Chak Tou Leong, Wenjie Li

TL;DR

This work tackles the challenge of long-horizon decision-making in LLM-based agents by introducing Step-level Trajectory Calibration (STeCa). STeCa detects suboptimal, step-level deviations using Monte Carlo-based rewards and constructs calibrated trajectories through LLM reflection, which are then used alongside successful trajectories for reinforced training. The framework combines a supervised warm-up, calibrated trajectory construction, and trajectory-level RL with a deviation-distance reward, and it demonstrates superior performance on VirtualHome and ALFWorld benchmarks. Key contributions include step-level reward acquisition, a practical deviation-detection criterion, calibrated trajectory generation via reflective thinking, and an integrated RL objective that leverages trajectory deviation distance for robust learning. Overall, STeCa improves robustness and success rates for long-horizon tasks and offers a scalable approach to real-time self-correction in LLM agents.

Abstract

Large language model (LLM)-based agents have shown promise in tackling complex tasks by interacting dynamically with the environment. Existing work primarily focuses on behavior cloning from expert demonstrations or preference learning through exploratory trajectory sampling. However, these methods often struggle to address long-horizon tasks, where suboptimal actions accumulate step by step, causing agents to deviate from correct task trajectories. To address this, we highlight the importance of timely calibration and the need to automatically construct calibration trajectories for training agents. We propose Step-Level Trajectory Calibration (STeCa), a novel framework for LLM agent learning. Specifically, STeCa identifies suboptimal actions through a step-level reward comparison during exploration. It constructs calibrated trajectories using LLM-driven reflection, enabling agents to learn from improved decision-making processes. We finally leverage these calibrated trajectories with successful trajectories for reinforced training. Extensive experiments demonstrate that STeCa significantly outperforms existing methods. Further analysis highlights that timely calibration enables agents to complete tasks with greater robustness. Our code and data are available at https://github.com/WangHanLinHenry/STeCa.

STeCa: Step-level Trajectory Calibration for LLM Agent Learning

TL;DR

This work tackles the challenge of long-horizon decision-making in LLM-based agents by introducing Step-level Trajectory Calibration (STeCa). STeCa detects suboptimal, step-level deviations using Monte Carlo-based rewards and constructs calibrated trajectories through LLM reflection, which are then used alongside successful trajectories for reinforced training. The framework combines a supervised warm-up, calibrated trajectory construction, and trajectory-level RL with a deviation-distance reward, and it demonstrates superior performance on VirtualHome and ALFWorld benchmarks. Key contributions include step-level reward acquisition, a practical deviation-detection criterion, calibrated trajectory generation via reflective thinking, and an integrated RL objective that leverages trajectory deviation distance for robust learning. Overall, STeCa improves robustness and success rates for long-horizon tasks and offers a scalable approach to real-time self-correction in LLM agents.

Abstract

Large language model (LLM)-based agents have shown promise in tackling complex tasks by interacting dynamically with the environment. Existing work primarily focuses on behavior cloning from expert demonstrations or preference learning through exploratory trajectory sampling. However, these methods often struggle to address long-horizon tasks, where suboptimal actions accumulate step by step, causing agents to deviate from correct task trajectories. To address this, we highlight the importance of timely calibration and the need to automatically construct calibration trajectories for training agents. We propose Step-Level Trajectory Calibration (STeCa), a novel framework for LLM agent learning. Specifically, STeCa identifies suboptimal actions through a step-level reward comparison during exploration. It constructs calibrated trajectories using LLM-driven reflection, enabling agents to learn from improved decision-making processes. We finally leverage these calibrated trajectories with successful trajectories for reinforced training. Extensive experiments demonstrate that STeCa significantly outperforms existing methods. Further analysis highlights that timely calibration enables agents to complete tasks with greater robustness. Our code and data are available at https://github.com/WangHanLinHenry/STeCa.

Paper Structure

This paper contains 49 sections, 8 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Step-level calibration enables LLM agents to construct calibrated trajectories and learn to mitigate the accumulation of suboptimal actions.
  • Figure 2: Overview of the Step-level Trajectory Calibration (STeCa) framework for LLM agent learning.
  • Figure 3: Variations in Monte Carlo (MC) step rewards with respect to the number of remaining steps until task completion for expert trajectories.
  • Figure 4: Calibration performance of different methods on the VirtualHome and ALFWorld datasets.
  • Figure 5: Correlation between the deviation distance and success rate (measured by average final reward).
  • ...and 5 more figures