STeCa: Step-level Trajectory Calibration for LLM Agent Learning
Hanlin Wang, Jian Wang, Chak Tou Leong, Wenjie Li
TL;DR
This work tackles the challenge of long-horizon decision-making in LLM-based agents by introducing Step-level Trajectory Calibration (STeCa). STeCa detects suboptimal, step-level deviations using Monte Carlo-based rewards and constructs calibrated trajectories through LLM reflection, which are then used alongside successful trajectories for reinforced training. The framework combines a supervised warm-up, calibrated trajectory construction, and trajectory-level RL with a deviation-distance reward, and it demonstrates superior performance on VirtualHome and ALFWorld benchmarks. Key contributions include step-level reward acquisition, a practical deviation-detection criterion, calibrated trajectory generation via reflective thinking, and an integrated RL objective that leverages trajectory deviation distance for robust learning. Overall, STeCa improves robustness and success rates for long-horizon tasks and offers a scalable approach to real-time self-correction in LLM agents.
Abstract
Large language model (LLM)-based agents have shown promise in tackling complex tasks by interacting dynamically with the environment. Existing work primarily focuses on behavior cloning from expert demonstrations or preference learning through exploratory trajectory sampling. However, these methods often struggle to address long-horizon tasks, where suboptimal actions accumulate step by step, causing agents to deviate from correct task trajectories. To address this, we highlight the importance of timely calibration and the need to automatically construct calibration trajectories for training agents. We propose Step-Level Trajectory Calibration (STeCa), a novel framework for LLM agent learning. Specifically, STeCa identifies suboptimal actions through a step-level reward comparison during exploration. It constructs calibrated trajectories using LLM-driven reflection, enabling agents to learn from improved decision-making processes. We finally leverage these calibrated trajectories with successful trajectories for reinforced training. Extensive experiments demonstrate that STeCa significantly outperforms existing methods. Further analysis highlights that timely calibration enables agents to complete tasks with greater robustness. Our code and data are available at https://github.com/WangHanLinHenry/STeCa.
