Table of Contents
Fetching ...

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Zeyuan Liu, Jeonghye Kim, Xufang Luo, Dongsheng Li, Yuqing Yang

TL;DR

In out-of-distribution tests, EMPO$^2$ demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates, highlighting EMPO$^2$ as a promising framework for building more exploratory and generalizable LLM-based agents.

Abstract

Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO$^2$), a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO$^2$ achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO$^2$ demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO$^2$ as a promising framework for building more exploratory and generalizable LLM-based agents.

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

TL;DR

In out-of-distribution tests, EMPO demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates, highlighting EMPO as a promising framework for building more exploratory and generalizable LLM-based agents.

Abstract

Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO), a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO as a promising framework for building more exploratory and generalizable LLM-based agents.
Paper Structure (29 sections, 2 equations, 13 figures, 3 tables, 1 algorithm)

This paper contains 29 sections, 2 equations, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: (a) Comparison of the learning curves of GRPO and EMPO$^2$ (ours) on the ScienceWorld power-component task. While GRPO converges to suboptimal performance, EMPO$^2$ continues to improve and accomplish the task. (b) Comparison of EMPO$^2$ and other baselines in in-distribution (ID) and out-of-distribution (OOD) settings on and WebShop. In ID experiments, it adapts well to familiar environments, achieving 128.6% on ScienceWorld and 11.3% on Webshop improvements over GRPO. In OOD experiments, it also shows strong performance with few trials and no weight updates, indicating effective use of memory to explore unfamiliar environments. Full results are in Tables \ref{['table:scienceworld']}, \ref{['table:webshop']}, and Figure \ref{['fig:ood']}.
  • Figure 2: Non-parametric updates can encourage exploration, bootstrapping parametric updates.
  • Figure 3: When training LLM with GRPO in ScienceWorld, the agent struggles because of insufficient exploration. For instance, in the task “turn on the red light bulb,” the agent must first find the red light bulb before activating it. However, the agent fails to locate it and, as a result, cannot complete the task. Rather than analyzing the cause of failure and exploring alternative actions, the agent proceeds unchanged, so its score stagnates even as additional training steps are taken.
  • Figure 4: In EMPO$^2$, the current policy parameters $\pi_\theta$ are used to review past rollouts, with the resulting insights added to memory. This updated memory conditions subsequent rollouts and promotes exploration.
  • Figure 5: EMPO$^2$ mode combinations. By combining the two rollout modes and update modes, three EMPO mode configurations are possible: on-policy learning without memory, on-policy learning with memory and off-policy learning.
  • ...and 8 more figures