Memory-Driven Self-Improvement for Decision Making with Large Language Models
Xue Yan, Zijing Ou, Mengyue Yang, Yan Song, Haifeng Zhang, Yingzhen Li, Jun Wang
TL;DR
This work tackles the challenge of adapting large language models (LLMs) to text-based sequential decision tasks with limited task-specific data by introducing a memory-driven self-improvement framework. It combines memory-based, non-parametric value estimation (Mem-Q) with memory-guided refinement of the LLM prior through an Expectation-Maximization (EM) procedure (Mem-EM), forming a bootstrapping loop where memory improves the LLM prior and the refined prior concentrates action search on high-quality candidates. The approach yields substantial gains on ALFWorld and Overcooked, achieving over 40% improvements in complex ALFWorld tasks and more than 75% gains on unseen tasks, while requiring only a few rounds of LLM fine-tuning. By leveraging retrieved, domain-specific experiences and a principled EM-based prior refinement, the framework delivers improved sample efficiency and robust generalization, offering a scalable route for memory-informed, language-grounded SDM without extensive fine-tuning.
Abstract
Large language models (LLMs) have emerged as effective action policies for sequential decision-making (SDM) tasks due to their extensive prior knowledge. However, this broad yet general knowledge is often insufficient for specific decision-making tasks with limited task-related data, making it challenging to efficiently adapt LLMs to specific SDM tasks. To address this challenge, we propose a memory-driven self-improvement framework that combines LLM general prior knowledge with a compact memory of domain-specific experiences. Memory retains past interactions and associated Q-values, thereby capturing decision-relevant knowledge that facilitates accurate value estimation and informs the LLM prior refinement. The refined LLM prior, in turn, generates higher-reward trajectories that further enrich memory, forming a natural self-improvement framework where memory and LLM prior mutually reinforce each other. Experiments show that our memory-driven approach significantly outperforms both traditional RL and LLM-based baselines, e.g., improving performance by over 40\% on in-distribution tasks and over 75\% when generalized to unseen tasks in ALFWorld.
