Table of Contents
Fetching ...

Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking

Yihan Chen, Benfeng Xu, Xiaorui Wang, Yongdong Zhang, Zhendong Mao

TL;DR

This work targets the bottlenecks of open-source LLM-based agents by introducing STeP, a training framework that combines Self-Reflected Trajectories with a Partial Masking strategy to inhibit learning from incorrect steps. By using a real-time teacher to evaluate and correct actions during interaction, STeP generates trajectories that teach the agent to reflect and self-correct, mitigating error cascades. The approach, validated on ALFWorld, WebShop, and SciWorld with LLaMA2-7B-chat and a strong open-source teacher (e.g., Qwen1.5-110B-Chat), yields notable improvements over golden-trajectories baselines and requires fewer self-reflected trajectories. These findings suggest that self-reflection-informed data augmentation can substantially boost the effectiveness and efficiency of open-source LLM agents in multi-task settings.

Abstract

Autonomous agents, which perceive environments and take actions to achieve goals, have become increasingly feasible with the advancements in large language models (LLMs). However, current powerful agents often depend on sophisticated prompt engineering combined with closed-source LLMs like GPT-4. Although training open-source LLMs using expert trajectories from teacher models has yielded some improvements in agent capabilities, this approach still faces limitations such as performance plateauing and error propagation. To mitigate these challenges, we propose STeP, a novel method for improving LLM-based agent training. We synthesize self-reflected trajectories that include reflections and corrections of error steps, which enhance the effectiveness of LLM agents in learning from teacher models, enabling them to become agents capable of self-reflecting and correcting. We also introduce partial masking strategy that prevents the LLM from internalizing incorrect or suboptimal steps. Experiments demonstrate that our method improves agent performance across three representative tasks: ALFWorld, WebShop, and SciWorld. For the open-source model LLaMA2-7B-Chat, when trained using self-reflected trajectories constructed with Qwen1.5-110B-Chat as the teacher model, it achieves comprehensive improvements with less training data compared to agents trained exclusively on expert trajectories.

Training LLM-Based Agents with Synthetic Self-Reflected Trajectories and Partial Masking

TL;DR

This work targets the bottlenecks of open-source LLM-based agents by introducing STeP, a training framework that combines Self-Reflected Trajectories with a Partial Masking strategy to inhibit learning from incorrect steps. By using a real-time teacher to evaluate and correct actions during interaction, STeP generates trajectories that teach the agent to reflect and self-correct, mitigating error cascades. The approach, validated on ALFWorld, WebShop, and SciWorld with LLaMA2-7B-chat and a strong open-source teacher (e.g., Qwen1.5-110B-Chat), yields notable improvements over golden-trajectories baselines and requires fewer self-reflected trajectories. These findings suggest that self-reflection-informed data augmentation can substantially boost the effectiveness and efficiency of open-source LLM agents in multi-task settings.

Abstract

Autonomous agents, which perceive environments and take actions to achieve goals, have become increasingly feasible with the advancements in large language models (LLMs). However, current powerful agents often depend on sophisticated prompt engineering combined with closed-source LLMs like GPT-4. Although training open-source LLMs using expert trajectories from teacher models has yielded some improvements in agent capabilities, this approach still faces limitations such as performance plateauing and error propagation. To mitigate these challenges, we propose STeP, a novel method for improving LLM-based agent training. We synthesize self-reflected trajectories that include reflections and corrections of error steps, which enhance the effectiveness of LLM agents in learning from teacher models, enabling them to become agents capable of self-reflecting and correcting. We also introduce partial masking strategy that prevents the LLM from internalizing incorrect or suboptimal steps. Experiments demonstrate that our method improves agent performance across three representative tasks: ALFWorld, WebShop, and SciWorld. For the open-source model LLaMA2-7B-Chat, when trained using self-reflected trajectories constructed with Qwen1.5-110B-Chat as the teacher model, it achieves comprehensive improvements with less training data compared to agents trained exclusively on expert trajectories.

Paper Structure

This paper contains 35 sections, 6 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: A self-reflected agent could autonomously identify, reflect on and correct errors based on interaction history.
  • Figure 2: Self-Reflected Trajectories on WebShop.
  • Figure 3: STeP utilizes golden trajectories and corresponding instructions to train a Self-reflected LLM-based agent through three stages. Stage 1: Agent Initialization; Stage 2: Self-Reflected Trajectories Synthesizing; Stage 3: SFT with Partial Masking.
  • Figure 4: Compared to golden only, self-reflected trajectories help LLMs learn more effectively and efficiently.
  • Figure 5: The number of Self-Reflected Trajectories generated by different teacher models, along with the average reward of the LLM-based agent trained on them.
  • ...and 4 more figures