Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

Siyu Yuan; Zehui Chen; Zhiheng Xi; Junjie Ye; Zhengyin Du; Jiecao Chen

Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, Jiecao Chen

TL;DR

Agent-R introduces an iterative self-training framework that endows language-model agents with on-the-fly reflection through model-guided revision trajectories constructed via Monte Carlo Tree Search. By identifying the first error and splicing in corrected segments, Agent-R enables timely error revision and self-improvement without sole reliance on expert trajectories. Phase I generates diverse revision trajectories; Phase II performs iterative self-training that blends revision, good, and general data, enabling scalable, multi-task learning. Across WebShop, ScienceWorld, and TextCraft, Agent-R achieves superior performance and reduced looping compared with baselines, demonstrating the value of self-reflection for robust, long-horizon, agentic tasks.

Abstract

Large Language Models (LLMs) agents are increasingly pivotal for addressing complex tasks in interactive environments. Existing work mainly focuses on enhancing performance through behavior cloning from stronger experts, yet such approaches often falter in real-world applications, mainly due to the inability to recover from errors. However, step-level critique data is difficult and expensive to collect. Automating and dynamically constructing self-critique datasets is thus crucial to empowering models with intelligent agent capabilities. In this work, we propose an iterative self-training framework, Agent-R, that enables language Agent to Reflect on the fly. Unlike traditional methods that reward or penalize actions based on correctness, Agent-R leverages MCTS to construct training data that recover correct trajectories from erroneous ones. A key challenge of agent reflection lies in the necessity for timely revision rather than waiting until the end of a rollout. To address this, we introduce a model-guided critique construction mechanism: the actor model identifies the first error step (within its current capability) in a failed trajectory. Starting from it, we splice it with the adjacent correct path, which shares the same parent node in the tree. This strategy enables the model to learn reflection based on its current policy, therefore yielding better learning efficiency. To further explore the scalability of this self-improvement paradigm, we investigate iterative refinement of both error correction capabilities and dataset construction. Our findings demonstrate that Agent-R continuously improves the model's ability to recover from errors and enables timely error correction. Experiments on three interactive environments show that Agent-R effectively equips agents to correct erroneous actions while avoiding loops, achieving superior performance compared to baseline methods (+5.59%).

Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

TL;DR

Abstract

Paper Structure (37 sections, 5 equations, 6 figures, 8 tables)

This paper contains 37 sections, 5 equations, 6 figures, 8 tables.

Introduction
Preliminary
Task Formulation
Monte Carlo Tree Search
Method
Phase I: Model-Guided Reflection Trajectory Generation
Reflection Trajectory Definition
Trajectory Collection with MCTS
Transition Point Determination with Actor Model
Phase II: Iterative Self-Training with Revision Trajectories
Experiment
Interactive and Agentic Environments
Experiment Setting
Data Split
MCTS Settings
...and 22 more sections

Figures (6)

Figure 1: Illustration of language agents struggling with error correction in trajectory generation. These errors can cause agents to enter loops, hindering recovery in long trajectories and resulting in suboptimal outcomes. Agent-R enables agents to effectively detect and address errors in real-time, handling long-horizon tasks and avoiding loops with greater self-reflection capabilities.
Figure 2: The framework of Agent-R consists of two phases. In Phase I, we adopt MCTS and a model-guided reflection mechanism to construct revision trajectories. In Phase II, the agents are trained using the collected revision trajectories. These two phases can be repeated iteratively. $\texttt{rs}$ is the revision signal, $t'$ is the transition point between the bad and good trajectories, and $L(\theta)$ is the loss function to be optimized.
Figure 3: Results of different training trajectories under different iterations on three interactive environments.
Figure 4: Average count of repeated action lengths for different training trajectories and different iterations in three interactive environments.
Figure 5: Average revision length of different iterations on three interactive environments.
...and 1 more figures

Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

TL;DR

Abstract

Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

Authors

TL;DR

Abstract

Table of Contents

Figures (6)