Table of Contents
Fetching ...

How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior

Zidi Xiong, Yuping Lin, Wenya Xie, Pengfei He, Zirui Liu, Jiliang Tang, Himabindu Lakkaraju, Zhen Xiang

TL;DR

This work studies how episodic memory management—specifically memory addition and deletion—shapes the long-term behavior of LLM agents. It uncovers an experience-following property where high input similarity between current tasks and retrieved memories leads to similar outputs, while also highlighting error propagation and misaligned experience replay as key obstacles. The authors demonstrate that evaluator-driven memory management, including strict selective addition and history-based deletion, can improve stability and performance, even under task distribution shifts and memory constraints. The findings offer practical guidelines for designing memory banks that support robust, enduring agent competence across diverse tasks and environments.

Abstract

Memory is a critical component in large language model (LLM)-based agents, enabling them to store and retrieve past executions to improve task performance over time. In this paper, we conduct an empirical study on how memory management choices impact the LLM agents' behavior, especially their long-term performance. Specifically, we focus on two fundamental memory management operations that are widely used by many agent frameworks-memory addition and deletion-to systematically study their impact on the agent behavior. Through our quantitative analysis, we find that LLM agents display an experience-following property: high similarity between a task input and the input in a retrieved memory record often results in highly similar agent outputs. Our analysis further reveals two significant challenges associated with this property: error propagation, where inaccuracies in past experiences compound and degrade future performance, and misaligned experience replay, where some seemingly correct executions can provide limited or even misleading value as experiences. Through controlled experiments, we demonstrate the importance of regulating experience quality within the memory bank and show that future task evaluations can serve as free quality labels for stored memory. Our findings offer insights into the behavioral dynamics of LLM agent memory systems and provide practical guidance for designing memory components that support robust, long-term agent performance.

How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior

TL;DR

This work studies how episodic memory management—specifically memory addition and deletion—shapes the long-term behavior of LLM agents. It uncovers an experience-following property where high input similarity between current tasks and retrieved memories leads to similar outputs, while also highlighting error propagation and misaligned experience replay as key obstacles. The authors demonstrate that evaluator-driven memory management, including strict selective addition and history-based deletion, can improve stability and performance, even under task distribution shifts and memory constraints. The findings offer practical guidelines for designing memory banks that support robust, enduring agent competence across diverse tasks and environments.

Abstract

Memory is a critical component in large language model (LLM)-based agents, enabling them to store and retrieve past executions to improve task performance over time. In this paper, we conduct an empirical study on how memory management choices impact the LLM agents' behavior, especially their long-term performance. Specifically, we focus on two fundamental memory management operations that are widely used by many agent frameworks-memory addition and deletion-to systematically study their impact on the agent behavior. Through our quantitative analysis, we find that LLM agents display an experience-following property: high similarity between a task input and the input in a retrieved memory record often results in highly similar agent outputs. Our analysis further reveals two significant challenges associated with this property: error propagation, where inaccuracies in past experiences compound and degrade future performance, and misaligned experience replay, where some seemingly correct executions can provide limited or even misleading value as experiences. Through controlled experiments, we demonstrate the importance of regulating experience quality within the memory bank and show that future task evaluations can serve as free quality labels for stored memory. Our findings offer insights into the behavioral dynamics of LLM agent memory systems and provide practical guidance for designing memory components that support robust, long-term agent performance.

Paper Structure

This paper contains 40 sections, 7 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Illustration of the memory management workflow after each agent execution.
  • Figure 2: Performance trend for EHRAgent and AgentDriver. 4o-mini, 4.1-mini, and 4.1-mini FT denote different coarse evaluators from the GPT series. Both the strict evaluator and some coarse evaluators exhibit consistent self-improvement over time.
  • Figure 3: Left: Output similarity versus input similarity for RegAgent over different evaluators. Right: Output similarity versus input similarity for AgentDriver over different evaluators.
  • Figure 4: Comparison of running performance between using the agent output as demonstrations and the error-free (EF) variant using ground-truth. Coarse here uses C1 evaluator.
  • Figure 5: Performance comparison after applying history-based deletion with different evaluators.
  • ...and 13 more figures