Table of Contents
Fetching ...

Improving Retrospective Language Agents via Joint Policy Gradient Optimization

Xueyang Feng, Bo Lan, Quanyu Dai, Lei Wang, Jiakai Tang, Xu Chen, Zhenhua Dong, Ji-Rong Wen

TL;DR

RetroAct addresses the challenge of enabling autonomous language agents to both plan and reflect without depending on closed-source LLMs. It introduces an off-policy, joint imitation-learning and reinforcement-learning framework that trains a planner and a reflector to iteratively improve task-solving behavior, with imitation regularization to stabilize learning. Across HotpotQA, ALFWorld, and InterCode using open-source backbones, RetroAct yields substantial performance gains and demonstrates mutual benefits between planning and reflection, approaching or surpassing strong baselines in some settings. The work advances practical, continuously improving retrospective agents and broadens the feasible deployment of open-source LLMs for complex reasoning and interaction tasks.

Abstract

In recent research advancements within the community, large language models (LLMs) have sparked great interest in creating autonomous agents. However, current prompt-based agents often heavily rely on large-scale LLMs. Meanwhile, although fine-tuning methods significantly enhance the capabilities of smaller LLMs, the fine-tuned agents often lack the potential for self-reflection and self-improvement. To address these challenges, we introduce a novel agent framework named RetroAct, which is a framework that jointly optimizes both task-planning and self-reflective evolution capabilities in language agents. Specifically, we develop a two-stage joint optimization process that integrates imitation learning and reinforcement learning, and design an off-policy joint policy gradient optimization algorithm with imitation learning regularization to enhance the data efficiency and training stability in agent tasks. RetroAct significantly improves the performance of open-source models, reduces dependency on closed-source LLMs, and enables fine-tuned agents to learn and evolve continuously. We conduct extensive experiments across various testing environments, demonstrating RetroAct has substantial improvements in task performance and decision-making processes.

Improving Retrospective Language Agents via Joint Policy Gradient Optimization

TL;DR

RetroAct addresses the challenge of enabling autonomous language agents to both plan and reflect without depending on closed-source LLMs. It introduces an off-policy, joint imitation-learning and reinforcement-learning framework that trains a planner and a reflector to iteratively improve task-solving behavior, with imitation regularization to stabilize learning. Across HotpotQA, ALFWorld, and InterCode using open-source backbones, RetroAct yields substantial performance gains and demonstrates mutual benefits between planning and reflection, approaching or surpassing strong baselines in some settings. The work advances practical, continuously improving retrospective agents and broadens the feasible deployment of open-source LLMs for complex reasoning and interaction tasks.

Abstract

In recent research advancements within the community, large language models (LLMs) have sparked great interest in creating autonomous agents. However, current prompt-based agents often heavily rely on large-scale LLMs. Meanwhile, although fine-tuning methods significantly enhance the capabilities of smaller LLMs, the fine-tuned agents often lack the potential for self-reflection and self-improvement. To address these challenges, we introduce a novel agent framework named RetroAct, which is a framework that jointly optimizes both task-planning and self-reflective evolution capabilities in language agents. Specifically, we develop a two-stage joint optimization process that integrates imitation learning and reinforcement learning, and design an off-policy joint policy gradient optimization algorithm with imitation learning regularization to enhance the data efficiency and training stability in agent tasks. RetroAct significantly improves the performance of open-source models, reduces dependency on closed-source LLMs, and enables fine-tuned agents to learn and evolve continuously. We conduct extensive experiments across various testing environments, demonstrating RetroAct has substantial improvements in task performance and decision-making processes.

Paper Structure

This paper contains 28 sections, 13 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of retrospective language agent. The planner analyzes task requirements, calls external tools, and gathers feedback. If planning fails, the reflector intervenes to adjust the strategy until the issue is resolved. Through joint strategy optimization, RetroAct continually enhances both the planner and reflector to tackle complex tasks more effectively.
  • Figure 2: Schematic of Joint Policy Gradient Optimization for Retrospective Language Agent. Our approach is divided into two stages: (a) Imitation Learning: We use expert models to generate expert trajectories, employ evaluators to filter out these trajectories, and then use them to fine-tune the student models. (b) Reinforcement Learning: The planner and reflector are jointly optimized through the off-policy reinforcement learning algorithm with the imitation learning regularizer.
  • Figure 3: Multi-Agent vs Single Agent (Q2)
  • Figure 4: Effectiveness of Optimized Planner and Reflector (Q3)
  • Figure 5: Effectiveness of Reinforcement Learning (Q4)
  • ...and 4 more figures