Table of Contents
Fetching ...

Memory-Driven Self-Improvement for Decision Making with Large Language Models

Xue Yan, Zijing Ou, Mengyue Yang, Yan Song, Haifeng Zhang, Yingzhen Li, Jun Wang

TL;DR

This work tackles the challenge of adapting large language models (LLMs) to text-based sequential decision tasks with limited task-specific data by introducing a memory-driven self-improvement framework. It combines memory-based, non-parametric value estimation (Mem-Q) with memory-guided refinement of the LLM prior through an Expectation-Maximization (EM) procedure (Mem-EM), forming a bootstrapping loop where memory improves the LLM prior and the refined prior concentrates action search on high-quality candidates. The approach yields substantial gains on ALFWorld and Overcooked, achieving over 40% improvements in complex ALFWorld tasks and more than 75% gains on unseen tasks, while requiring only a few rounds of LLM fine-tuning. By leveraging retrieved, domain-specific experiences and a principled EM-based prior refinement, the framework delivers improved sample efficiency and robust generalization, offering a scalable route for memory-informed, language-grounded SDM without extensive fine-tuning.

Abstract

Large language models (LLMs) have emerged as effective action policies for sequential decision-making (SDM) tasks due to their extensive prior knowledge. However, this broad yet general knowledge is often insufficient for specific decision-making tasks with limited task-related data, making it challenging to efficiently adapt LLMs to specific SDM tasks. To address this challenge, we propose a memory-driven self-improvement framework that combines LLM general prior knowledge with a compact memory of domain-specific experiences. Memory retains past interactions and associated Q-values, thereby capturing decision-relevant knowledge that facilitates accurate value estimation and informs the LLM prior refinement. The refined LLM prior, in turn, generates higher-reward trajectories that further enrich memory, forming a natural self-improvement framework where memory and LLM prior mutually reinforce each other. Experiments show that our memory-driven approach significantly outperforms both traditional RL and LLM-based baselines, e.g., improving performance by over 40\% on in-distribution tasks and over 75\% when generalized to unseen tasks in ALFWorld.

Memory-Driven Self-Improvement for Decision Making with Large Language Models

TL;DR

This work tackles the challenge of adapting large language models (LLMs) to text-based sequential decision tasks with limited task-specific data by introducing a memory-driven self-improvement framework. It combines memory-based, non-parametric value estimation (Mem-Q) with memory-guided refinement of the LLM prior through an Expectation-Maximization (EM) procedure (Mem-EM), forming a bootstrapping loop where memory improves the LLM prior and the refined prior concentrates action search on high-quality candidates. The approach yields substantial gains on ALFWorld and Overcooked, achieving over 40% improvements in complex ALFWorld tasks and more than 75% gains on unseen tasks, while requiring only a few rounds of LLM fine-tuning. By leveraging retrieved, domain-specific experiences and a principled EM-based prior refinement, the framework delivers improved sample efficiency and robust generalization, offering a scalable route for memory-informed, language-grounded SDM without extensive fine-tuning.

Abstract

Large language models (LLMs) have emerged as effective action policies for sequential decision-making (SDM) tasks due to their extensive prior knowledge. However, this broad yet general knowledge is often insufficient for specific decision-making tasks with limited task-related data, making it challenging to efficiently adapt LLMs to specific SDM tasks. To address this challenge, we propose a memory-driven self-improvement framework that combines LLM general prior knowledge with a compact memory of domain-specific experiences. Memory retains past interactions and associated Q-values, thereby capturing decision-relevant knowledge that facilitates accurate value estimation and informs the LLM prior refinement. The refined LLM prior, in turn, generates higher-reward trajectories that further enrich memory, forming a natural self-improvement framework where memory and LLM prior mutually reinforce each other. Experiments show that our memory-driven approach significantly outperforms both traditional RL and LLM-based baselines, e.g., improving performance by over 40\% on in-distribution tasks and over 75\% when generalized to unseen tasks in ALFWorld.

Paper Structure

This paper contains 20 sections, 14 equations, 6 figures, 6 tables, 2 algorithms.

Figures (6)

  • Figure 1: Motivation and overview of our memory-driven self-improvement framework for text-based SDM. Left: existing approaches (prompt-engineering, fine-tuning, and RL with LLM priors) struggle under sparse signals and domain-specific data. Right: Our framework introduces two complementary roles: (1) memory-driven value estimation, which enables efficient exploration, and (2) LLM prior refinement, which biases action generation toward high-quality candidates; together forming a self-improvement loop that resists scarce experience and enables efficient adaptation.
  • Figure 2: Results of memory-drive Q-learning on Overcooked. Left: effect of the number of retrieved $(s,a)$ pairs for value estimation; Right: effect of different LLMs on representations.
  • Figure 3: Results of comparison with baselines. We plot the mean and standard error of the cumulative reward. The dashed line represents directly prompting the LLM prior to generating actions given state information, with the corresponding LLM version specified. The '$\times$' markers for $\text{Mem-EM}_\text{w/ tune}$ indicate the time steps when the LLM prior is fine-tuned.
  • Figure 4: Ablation study results. (a) Effect of the number of action candidates $K$ generated by the LLM. The '$\times$' markers indicate the time steps when the LLM-prior is updated. (b) Impact of the LLM fine-tuning interval, where $n$ denotes that the LLM policy is fine-tuned every $n$ episodes. (c) Influence of the memory table capacity $N$, where at most $N$$(s,a)$ pairs are stored, with the least-recently-used (LRU) strategy applied for replacement.
  • Figure 5: Ablation study on finetuning the embedding model bert.
  • ...and 1 more figures