Table of Contents
Fetching ...

Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents

Yifei Li, Weidong Guo, Lingling Zhang, Rongman Xu, Muye Huang, Hui Liu, Lijiao Xu, Yu Xu, Jun Liu

TL;DR

LoCoMo-Plus expands long-term conversational memory evaluation beyond explicit factual recall by grounding assessment in latent user constraints that guide behavior under cue--trigger disconnect. The authors propose a constraint-consistency framework and a rigorous benchmark construction pipeline that uses implicit cue dialogues, verification, and insertion into long dialogues, evaluated with LLM-based judges and human validation. Across diverse backbones, retrieval strategies, and memory systems, LoCoMo-Plus reveals a persistent gap and challenges in cognitive memory, highlighting biases in traditional prompt-disclosed and surface-metric evaluations. The work provides a replicable, open-source evaluation framework and emphasizes the need to rethink benchmarks and evaluation protocols to drive progress in memory-enabled conversational AI.

Abstract

Long-term conversational memory is a core capability for LLM-based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface-level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To evaluate this setting, we introduce \textbf{LoCoMo-Plus}, a benchmark for assessing cognitive memory under cue--trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. We further show that conventional string-matching metrics and explicit task-type prompting are misaligned with such scenarios, and propose a unified evaluation framework based on constraint consistency. Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks. Our code and evaluation framework are publicly available at: https://github.com/xjtuleeyf/Locomo-Plus.

Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents

TL;DR

LoCoMo-Plus expands long-term conversational memory evaluation beyond explicit factual recall by grounding assessment in latent user constraints that guide behavior under cue--trigger disconnect. The authors propose a constraint-consistency framework and a rigorous benchmark construction pipeline that uses implicit cue dialogues, verification, and insertion into long dialogues, evaluated with LLM-based judges and human validation. Across diverse backbones, retrieval strategies, and memory systems, LoCoMo-Plus reveals a persistent gap and challenges in cognitive memory, highlighting biases in traditional prompt-disclosed and surface-metric evaluations. The work provides a replicable, open-source evaluation framework and emphasizes the need to rethink benchmarks and evaluation protocols to drive progress in memory-enabled conversational AI.

Abstract

Long-term conversational memory is a core capability for LLM-based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface-level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To evaluate this setting, we introduce \textbf{LoCoMo-Plus}, a benchmark for assessing cognitive memory under cue--trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. We further show that conventional string-matching metrics and explicit task-type prompting are misaligned with such scenarios, and propose a unified evaluation framework based on constraint consistency. Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks. Our code and evaluation framework are publicly available at: https://github.com/xjtuleeyf/Locomo-Plus.
Paper Structure (42 sections, 1 equation, 7 figures, 7 tables)

This paper contains 42 sections, 1 equation, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Illustration of the gap between factual memory evaluation and the richer associative nature of biological memory, motivating the expansion toward beyond-factual cognitive memory benchmarks.
  • Figure 2: Cognitive memory in LoCoMo-Plus. Left: distribution of original LoCoMo question types and the additional cognitive memory QA instances introduced in LoCoMo-Plus. Right: cognitive memory is decomposed into four latent constraints—causal, state, goal, and value.
  • Figure 3: LoCoMo-Plus benchmark construction pipeline with aligned process and result layers.
  • Figure 4: Conceptual comparison between the existing evaluation framework and the proposed evaluation paradigm for conversational memory.
  • Figure 5: Comparison of task-disclosed and unified dialogue inputs across task types, evaluated with different output-side assessment methods and model families.
  • ...and 2 more figures