Table of Contents
Fetching ...

Evaluating Memory Structure in LLM Agents

Alina Shutova, Alexandra Olenina, Ivan Vinogradov, Anton Sinitsin

TL;DR

This paper addresses how LLM agents manage long-term memory beyond pure recall by introducing StructMemEval, a benchmark focused on memory-organization patterns such as trees, ledgers, and state-tracking. It contrasts retrieval-based baselines with memory-augmented approaches (Mem0, Mem-agent), showing that explicit memory structuring—especially when prompted—significantly improves performance on tasks that require organizing knowledge. The work provides 73 evaluation scenarios with 544 questions across multiple memory patterns and demonstrates that modern LLMs do not consistently recognize the intended structure without hints, revealing a key avenue for training and memory-system design. Overall, StructMemEval offers a concrete, architecture-agnostic framework to assess and guide the development of memory hierarchies in LLM agents, with implications for more reliable long-term reasoning and user-data personalization.

Abstract

Modern LLM-based agents and chat assistants rely on long-term memory frameworks to store reusable knowledge, recall user preferences, and augment reasoning. As researchers create more complex memory architectures, it becomes increasingly difficult to analyze their capabilities and guide future memory designs. Most long-term memory benchmarks focus on simple fact retention, multi-hop recall, and time-based changes. While undoubtedly important, these capabilities can often be achieved with simple retrieval-augmented LLMs and do not test complex memory hierarchies. To bridge this gap, we propose StructMemEval - a benchmark that tests the agent's ability to organize its long-term memory, not just factual recall. We gather a suite of tasks that humans solve by organizing their knowledge in a specific structure: transaction ledgers, to-do lists, trees and others. Our initial experiments show that simple retrieval-augmented LLMs struggle with these tasks, whereas memory agents can reliably solve them if prompted how to organize their memory. However, we also find that modern LLMs do not always recognize the memory structure when not prompted to do so. This highlights an important direction for future improvements in both LLM training and memory frameworks.

Evaluating Memory Structure in LLM Agents

TL;DR

This paper addresses how LLM agents manage long-term memory beyond pure recall by introducing StructMemEval, a benchmark focused on memory-organization patterns such as trees, ledgers, and state-tracking. It contrasts retrieval-based baselines with memory-augmented approaches (Mem0, Mem-agent), showing that explicit memory structuring—especially when prompted—significantly improves performance on tasks that require organizing knowledge. The work provides 73 evaluation scenarios with 544 questions across multiple memory patterns and demonstrates that modern LLMs do not consistently recognize the intended structure without hints, revealing a key avenue for training and memory-system design. Overall, StructMemEval offers a concrete, architecture-agnostic framework to assess and guide the development of memory hierarchies in LLM agents, with implications for more reliable long-term reasoning and user-data personalization.

Abstract

Modern LLM-based agents and chat assistants rely on long-term memory frameworks to store reusable knowledge, recall user preferences, and augment reasoning. As researchers create more complex memory architectures, it becomes increasingly difficult to analyze their capabilities and guide future memory designs. Most long-term memory benchmarks focus on simple fact retention, multi-hop recall, and time-based changes. While undoubtedly important, these capabilities can often be achieved with simple retrieval-augmented LLMs and do not test complex memory hierarchies. To bridge this gap, we propose StructMemEval - a benchmark that tests the agent's ability to organize its long-term memory, not just factual recall. We gather a suite of tasks that humans solve by organizing their knowledge in a specific structure: transaction ledgers, to-do lists, trees and others. Our initial experiments show that simple retrieval-augmented LLMs struggle with these tasks, whereas memory agents can reliably solve them if prompted how to organize their memory. However, we also find that modern LLMs do not always recognize the memory structure when not prompted to do so. This highlights an important direction for future improvements in both LLM training and memory frameworks.
Paper Structure (11 sections, 1 figure, 5 tables)

This paper contains 11 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Top: accuracies across task types for retrieval-augmented LLM, Mem-agent, and Mem0. Bottom: detailed accuracies per difficulty level (see X axis label) using the same colors as above.