Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Zeyuan Liu; Jeonghye Kim; Xufang Luo; Dongsheng Li; Yuqing Yang

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Zeyuan Liu, Jeonghye Kim, Xufang Luo, Dongsheng Li, Yuqing Yang

TL;DR

In out-of-distribution tests, EMPO$^2$ demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates, highlighting EMPO$^2$ as a promising framework for building more exploratory and generalizable LLM-based agents.

Abstract

Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO$^2$), a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO$^2$ achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO$^2$ demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO$^2$ as a promising framework for building more exploratory and generalizable LLM-based agents.

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

TL;DR

In out-of-distribution tests, EMPO

demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates, highlighting EMPO

as a promising framework for building more exploratory and generalizable LLM-based agents.

Abstract

), a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO

achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO

demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO

as a promising framework for building more exploratory and generalizable LLM-based agents.

Paper Structure (29 sections, 2 equations, 13 figures, 3 tables, 1 algorithm)

This paper contains 29 sections, 2 equations, 13 figures, 3 tables, 1 algorithm.

Introduction
Preliminaries
The Exploration Problem of LLM Agents
Method
Advancing Exploration with Self-Generated Memory
Parameterize non-parametric updates via hybrid policy optimization
Related Work
Experiments
ScienceWorld
WebShop
Ablation study on Mode Combinations
Conclusion
Pseudo Code
Prompts
Detailed Explanation of Importance Sampling Ratios in Policy Updates
...and 14 more sections

Figures (13)

Figure 1: (a) Comparison of the learning curves of GRPO and EMPO$^2$ (ours) on the ScienceWorld power-component task. While GRPO converges to suboptimal performance, EMPO$^2$ continues to improve and accomplish the task. (b) Comparison of EMPO$^2$ and other baselines in in-distribution (ID) and out-of-distribution (OOD) settings on and WebShop. In ID experiments, it adapts well to familiar environments, achieving 128.6% on ScienceWorld and 11.3% on Webshop improvements over GRPO. In OOD experiments, it also shows strong performance with few trials and no weight updates, indicating effective use of memory to explore unfamiliar environments. Full results are in Tables \ref{['table:scienceworld']}, \ref{['table:webshop']}, and Figure \ref{['fig:ood']}.
Figure 2: Non-parametric updates can encourage exploration, bootstrapping parametric updates.
Figure 3: When training LLM with GRPO in ScienceWorld, the agent struggles because of insufficient exploration. For instance, in the task “turn on the red light bulb,” the agent must first find the red light bulb before activating it. However, the agent fails to locate it and, as a result, cannot complete the task. Rather than analyzing the cause of failure and exploring alternative actions, the agent proceeds unchanged, so its score stagnates even as additional training steps are taken.
Figure 4: In EMPO$^2$, the current policy parameters $\pi_\theta$ are used to review past rollouts, with the resulting insights added to memory. This updated memory conditions subsequent rollouts and promotes exploration.
Figure 5: EMPO$^2$ mode combinations. By combining the two rollout modes and update modes, three EMPO mode configurations are possible: on-policy learning without memory, on-policy learning with memory and off-policy learning.
...and 8 more figures

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

TL;DR

Abstract

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (13)