Table of Contents
Fetching ...

LMEB: Long-horizon Memory Embedding Benchmark

Xinping Zhao, Xinshuo Hu, Jiaxin Xu, Danyu Tang, Xin Zhang, Mengjia Zhou, Yan Zhong, Yao Zhou, Zifei Shan, Meishan Zhang, Baotian Hu, Min Zhang

Abstract

Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at https://github.com/KaLM-Embedding/LMEB.

LMEB: Long-horizon Memory Embedding Benchmark

Abstract

Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at https://github.com/KaLM-Embedding/LMEB.
Paper Structure (24 sections, 2 equations, 9 figures, 23 tables)

This paper contains 24 sections, 2 equations, 9 figures, 23 tables.

Figures (9)

  • Figure 1: Overview of LMEB memory categories and specificities. Table \ref{['tab:dataset_stats']} presents detailed dataset statistics. Tables \ref{['tab:episodic_example']}, \ref{['tab:dialogue_example']}, \ref{['tab:semantic_example']}, and \ref{['tab:procedural_example']} provide examples of query-relevant document pairs for each dataset. Tables \ref{['tab:episodic_tasks']}, \ref{['tab:dialogue_tasks']}, \ref{['tab:semantic_tasks']}, and \ref{['tab:procedural_tasks']} provide detailed task types and example abilities assessed.
  • Figure 2: Memory taxonomy of LMEB.
  • Figure 3: Inter-dataset diversity in LMEB. The left side illustrates pairwise weighted Jaccard Similarity (JS) scores between unigram word distributions of each dataset corpus, while the right side shows dataset relationships with a force-directed 2D layout.
  • Figure 4: Performance comparison between w/o inst. and w/ inst.. The x-axis represents the two conditions (w/o inst. and w/ inst.), and the y-axis indicates the N@10 performance.
  • Figure 5: Correlation between the evaluation scores on LMEB and MTEB (eng, v2) (retrieval subset) DBLP:conf/iclr/EnevoldsenCKKMS25. The evaluation score for LMEB is based on the N@10 metric, aligned with MTEB (eng, v2) (retrieval subset). Mean (Dataset) scores under w/ inst. are used for LMEB. Note that bge-m3 (Dense), bge-large-en-v1.5, and EmbeddingGemma-300M perform better without task instructions, so their results are based on the w/o inst. setting. The size of the points in the plot is proportional to the size of the models.
  • ...and 4 more figures