Re-ReST: Reflection-Reinforced Self-Training for Language Agents

Zi-Yi Dou; Cheng-Fu Yang; Xueqing Wu; Kai-Wei Chang; Nanyun Peng

Re-ReST: Reflection-Reinforced Self-Training for Language Agents

Zi-Yi Dou, Cheng-Fu Yang, Xueqing Wu, Kai-Wei Chang, Nanyun Peng

TL;DR

Re-ReST introduces a reflection-based augmentation to self-training for language agents, employing a reflector that uses environmental feedback to refine low-quality samples before incorporating them into training. The approach enables autonomous, open-source improvement of agents across multi-hop reasoning, sequential decision-making, coding, VQA, and text-to-image generation, without relying on proprietary models during training. Empirical results show that self-training improves baselines, and Re-ReST further boosts performance by notable margins across tasks; test-time reflection with self-consistency and direct preference optimization extend applicability. Overall, the framework offers a practical, scalable pathway to more capable language agents by efficiently leveraging internal feedback loops during training while keeping inference lightweight.

Abstract

Finetuning language agents with reasoning-action trajectories is effective, but obtaining these trajectories from human annotations or stronger models is costly and sometimes impractical. In this paper, we investigate the use of self-training in language agents, which can generate supervision from the agent itself, offering a promising alternative without relying on human or stronger model demonstrations. Self-training, however, requires high-quality model-generated samples, which are hard to obtain for challenging language agent tasks. To address this, we present Reflection-Reinforced Self-Training (Re-ReST), which uses a \textit{reflector} to refine low-quality generated samples during self-training. The reflector takes the agent's output and feedback from an external environment (e.g., unit test results in code generation) to produce improved samples. This technique enhances the quality of inferior samples and efficiently enriches the self-training dataset with higher-quality samples. We conduct extensive experiments on open-source language agents across tasks, including multi-hop question answering, sequential decision-making, code generation, visual question answering, and text-to-image generation. The results demonstrate the effectiveness of self-training and Re-ReST in language agent tasks, with self-training improving baselines by 7.6\% on HotpotQA and 28.4\% on AlfWorld, and Re-ReST further boosting performance by 2.0\% and 14.1\%, respectively. Our studies also confirm the efficiency of using a reflector to generate high-quality samples for self-training. Moreover, we demonstrate a method to employ reflection during inference without ground-truth feedback, addressing the limitation of previous reflection work. Our code is released at https://github.com/PlusLabNLP/Re-ReST.

Re-ReST: Reflection-Reinforced Self-Training for Language Agents

TL;DR

Abstract

Paper Structure (45 sections, 2 equations, 3 figures, 15 tables)

This paper contains 45 sections, 2 equations, 3 figures, 15 tables.

Introduction
Method: Re-ReST
Self-Training.
Overview of Re-ReST.
Components
Language Agent.
Reflector.
Data Generation
Initial Generation.
Reflection with Environmental Feedback.
Model Training and Inference
Reflector Training.
Language Agent Training.
Inference.
Experiments
...and 30 more sections

Figures (3)

Figure 1: Previous agent training methods chen2023fireactyin2024lumos distill knowledge from stronger models (e.g., GPT-4) to weaker ones (e.g., Llama-2). In contrast, we adopt self-training and improve it with reflection to improve agents more autonomously, which reduces reliance on external propriety models and maintains a fully open-source framework.
Figure 2: An overview of our Re-ReST method. Our approach incorporates self-training in language agent tasks by sampling multiple outputs from an agent and using positive samples for training. To enhance the effectiveness of self-training in language agents, we introduce a reflector mechanism. If a sample is incorrect, the reflector adjusts the agent's output based on environmental feedback. The corrected sample is then incorporated into the training data, thereby improving the overall self-training process.
Figure 3: In self-training, increasing the number of generations per instance initially improves model performance, but this effect plateaus. Additionally, both model performance and the number of solved training instances are lower than with Re-ReST, indicating our reflector can efficiently and effectively generate high-quality self-training data.

Re-ReST: Reflection-Reinforced Self-Training for Language Agents

TL;DR

Abstract

Re-ReST: Reflection-Reinforced Self-Training for Language Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (3)