Table of Contents
Fetching ...

Large Language Models Are Semi-Parametric Reinforcement Learning Agents

Danyang Zhang, Lu Chen, Situo Zhang, Hongshen Xu, Zihan Zhao, Kai Yu

TL;DR

This work addresses the challenge of enabling LLM-based agents to learn from interaction experiences without fine-tuning the model. It introduces Reinforcement Learning with Experience Memory (RLEM) and the Rememberer architecture, which couples an LLM with a persistent external experience memory and updates that memory through RL signals. By retrieving past experiences as dynamic exemplars and providing action-advice that includes encouraged and discouraged options, Rememberer achieves state-of-the-art performance on WebShop and WikiHow benchmarks and demonstrates robustness across initial exemplars and training sets. The approach offers a practical, evolvable, semi-parametric alternative to fully parametric fine-tuning for sequential decision-making tasks. Overall, Rememberer shows how external, environment-grounded memory can empower LLMs to continually improve without altering their parameters.

Abstract

Inspired by the insights in cognitive science with respect to human memory and reasoning mechanism, a novel evolvable LLM-based (Large Language Model) agent framework is proposed as REMEMBERER. By equipping the LLM with a long-term experience memory, REMEMBERER is capable of exploiting the experiences from the past episodes even for different task goals, which excels an LLM-based agent with fixed exemplars or equipped with a transient working memory. We further introduce Reinforcement Learning with Experience Memory (RLEM) to update the memory. Thus, the whole system can learn from the experiences of both success and failure, and evolve its capability without fine-tuning the parameters of the LLM. In this way, the proposed REMEMBERER constitutes a semi-parametric RL agent. Extensive experiments are conducted on two RL task sets to evaluate the proposed framework. The average results with different initialization and training sets exceed the prior SOTA by 4% and 2% for the success rate on two task sets and demonstrate the superiority and robustness of REMEMBERER.

Large Language Models Are Semi-Parametric Reinforcement Learning Agents

TL;DR

This work addresses the challenge of enabling LLM-based agents to learn from interaction experiences without fine-tuning the model. It introduces Reinforcement Learning with Experience Memory (RLEM) and the Rememberer architecture, which couples an LLM with a persistent external experience memory and updates that memory through RL signals. By retrieving past experiences as dynamic exemplars and providing action-advice that includes encouraged and discouraged options, Rememberer achieves state-of-the-art performance on WebShop and WikiHow benchmarks and demonstrates robustness across initial exemplars and training sets. The approach offers a practical, evolvable, semi-parametric alternative to fully parametric fine-tuning for sequential decision-making tasks. Overall, Rememberer shows how external, environment-grounded memory can empower LLMs to continually improve without altering their parameters.

Abstract

Inspired by the insights in cognitive science with respect to human memory and reasoning mechanism, a novel evolvable LLM-based (Large Language Model) agent framework is proposed as REMEMBERER. By equipping the LLM with a long-term experience memory, REMEMBERER is capable of exploiting the experiences from the past episodes even for different task goals, which excels an LLM-based agent with fixed exemplars or equipped with a transient working memory. We further introduce Reinforcement Learning with Experience Memory (RLEM) to update the memory. Thus, the whole system can learn from the experiences of both success and failure, and evolve its capability without fine-tuning the parameters of the LLM. In this way, the proposed REMEMBERER constitutes a semi-parametric RL agent. Extensive experiments are conducted on two RL task sets to evaluate the proposed framework. The average results with different initialization and training sets exceed the prior SOTA by 4% and 2% for the success rate on two task sets and demonstrate the superiority and robustness of REMEMBERER.
Paper Structure (31 sections, 9 equations, 11 figures, 13 tables)

This paper contains 31 sections, 9 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Comparison of the LLM-based agents with short-term working memory and long-term experience memory. The working memory stores only the historical information of the current episode ($\mathcal{H}$). while the experience memory stores the interaction experiences ($\mathcal{E}$) permanently.
  • Figure 2: Pipeline of RLEM and architecture of Rememberer
  • Figure 3: An example of the records stored in the proposed experience memory.
  • Figure 4: An exemplar for WebShop task set ShunyuYao2022_WebShop. The input part is depicted in the upper box and the output part is depicted in the lower box. Action candidates are advised along with their $Q$ value estimations and some optional extra information.
  • Figure 5: Example of the observation of WebShop
  • ...and 6 more figures