Table of Contents
Fetching ...

Human-inspired Episodic Memory for Infinite Context LLMs

Zafeirios Fountas, Martin A Benfeghoul, Adnan Oomerjee, Fenia Christopoulou, Gerasimos Lampouras, Haitham Bou-Ammar, Jun Wang

TL;DR

This work addresses the difficulty of maintaining coherence over extremely long contexts in LLMs by introducing EM-LLM, a memory-augmented framework inspired by human episodic memory and event cognition. It forms memories via surprise-driven event boundaries, refines these boundaries with graph-theoretic metrics, and retrieves memories through a two-stage process that combines similarity and temporal contiguity, enabling layer-wise access without fine-tuning. Empirically, EM-LLM achieves state-of-the-art long-context performance on LongBench and ∞-Bench, surpasses RAG and full-context baselines on most tasks, and can retrieve from contexts up to $10^7$ tokens, demonstrating practically infinite context handling. The work also shows correlations between EM-LLM’s event segmentation and human-perceived events, suggesting cognitive parallels and offering a computational framework for studying human memory mechanisms alongside practical gains in AI systems.

Abstract

Large language models (LLMs) have shown remarkable capabilities, but still struggle with processing extensive contexts, limiting their ability to maintain coherence and accuracy over long sequences. In contrast, the human brain excels at organising and retrieving episodic experiences across vast temporal scales, spanning a lifetime. In this work, we introduce EM-LLM, a novel approach that integrates key aspects of human episodic memory and event cognition into LLMs with no fine-tuning, enabling them to handle practically infinite context lengths while maintaining computational efficiency. EM-LLM organises sequences of tokens into coherent episodic events using a combination of Bayesian surprise and graph-theoretic boundary refinement in an online fashion. When needed, these events are retrieved through a two-stage memory process, combining similarity-based and temporally contiguous retrieval for efficient, human-inspired access to relevant information. Experiments on the LongBench and $\infty$-Bench benchmarks demonstrate EM-LLM's superior performance, consistently outperforming the state-of-the-art retrieval model InfLLM across various baseline LLMs. In addition, EM-LLM outperforms its popular counterpart, RAG, in a wide range of tasks, while requiring similar resources. Notably, EM-LLM's performance even surpasses full-context models in most tasks, while successfully performing retrieval across 10 million tokens -- a scale computationally infeasible for such models. Finally, our analysis reveals strong correlations between EM-LLM's event segmentation and human-perceived events, suggesting parallels between this artificial system and its biological counterpart, thereby offering a novel computational framework for exploring human memory mechanisms.

Human-inspired Episodic Memory for Infinite Context LLMs

TL;DR

This work addresses the difficulty of maintaining coherence over extremely long contexts in LLMs by introducing EM-LLM, a memory-augmented framework inspired by human episodic memory and event cognition. It forms memories via surprise-driven event boundaries, refines these boundaries with graph-theoretic metrics, and retrieves memories through a two-stage process that combines similarity and temporal contiguity, enabling layer-wise access without fine-tuning. Empirically, EM-LLM achieves state-of-the-art long-context performance on LongBench and ∞-Bench, surpasses RAG and full-context baselines on most tasks, and can retrieve from contexts up to tokens, demonstrating practically infinite context handling. The work also shows correlations between EM-LLM’s event segmentation and human-perceived events, suggesting cognitive parallels and offering a computational framework for studying human memory mechanisms alongside practical gains in AI systems.

Abstract

Large language models (LLMs) have shown remarkable capabilities, but still struggle with processing extensive contexts, limiting their ability to maintain coherence and accuracy over long sequences. In contrast, the human brain excels at organising and retrieving episodic experiences across vast temporal scales, spanning a lifetime. In this work, we introduce EM-LLM, a novel approach that integrates key aspects of human episodic memory and event cognition into LLMs with no fine-tuning, enabling them to handle practically infinite context lengths while maintaining computational efficiency. EM-LLM organises sequences of tokens into coherent episodic events using a combination of Bayesian surprise and graph-theoretic boundary refinement in an online fashion. When needed, these events are retrieved through a two-stage memory process, combining similarity-based and temporally contiguous retrieval for efficient, human-inspired access to relevant information. Experiments on the LongBench and -Bench benchmarks demonstrate EM-LLM's superior performance, consistently outperforming the state-of-the-art retrieval model InfLLM across various baseline LLMs. In addition, EM-LLM outperforms its popular counterpart, RAG, in a wide range of tasks, while requiring similar resources. Notably, EM-LLM's performance even surpasses full-context models in most tasks, while successfully performing retrieval across 10 million tokens -- a scale computationally infeasible for such models. Finally, our analysis reveals strong correlations between EM-LLM's event segmentation and human-perceived events, suggesting parallels between this artificial system and its biological counterpart, thereby offering a novel computational framework for exploring human memory mechanisms.
Paper Structure (56 sections, 11 equations, 14 figures, 12 tables, 1 algorithm)

This paper contains 56 sections, 11 equations, 14 figures, 12 tables, 1 algorithm.

Figures (14)

  • Figure 1: Top: EM-LLM$_S$ (surprise only) vs. RAG (NV-Embed-v2 retriever) vs. full-context, with LLaMA-3.1-8B as the base LLM, evaluated on LongBench. Bottom: Comparison of various long-sequence methods (sorted based on their context window length) on an extended version of $\infty$-Bench's Retrieve.PassKey. Baseline data taken from Ding:2024:LongRoPE.
  • Figure 2: Group-based $k$-NN retrieval can be seen as a form of hierarchical episodic attention. Initially, $k=4$ groups of tokens are selected (left) and then used for softmax attention (right), as if all other similarity scores were forced to be zero (non-shaded areas of the left curve). This framework can support multiple levels of episodic attention.
  • Figure 3: (A) Example of the temporal contiguity and asymmetry effect in human free recall. Data averaged over several large free recall studies (adopted from Howard:2002:TCM). (B) The attention scores of a GPT2 head averaged over all tokens tested (adopted from Jian:2024:CMRandLLMs). (C) Schematic illustrating our proposed process for memory formation and retrieval in each layer: Input sequence with surprise-based segmentation (purple arrows indicate high surprise). Formation of episodic memories: input is segmented into events and stored, with initial tokens and local context preserved. Note that the boundary refinement process is not shown here for clarity. Memory retrieval via k-NN search, selecting contiguous events from episodic memory. Final context window structure, comprising initial tokens, contiguity buffer (populated by neighbouring events), similarity buffer (from k-NN retrieval), and local context.
  • Figure 4: Comparison of human event segmentation with different computational segmentation methods in a human-annotated audio dataset (see also Appendix \ref{['appdx:human_data']}). (A) Difference in metrics for the cohesion and separation of KV cache of each LLaMA2 layer. The graphs report the difference of each method with the corresponding random segmentation. (B) Distance between human reports and different methods. In both sets of results, fixed methods (F, FM, FC | with M: Modularity, C: Conductance) perform worse than their surprise-based counterparts (S, SM, SC) with InfLLM's method (F) performing worse than random.
  • Figure 5: The ratio of blocks retrieved by a layer which were not retrieved by any other layer for the same processed chunk, versus the total number of retrieved blocks by that layer. This is measured using EM-LLM$_S$ with Mistral-7B on a single example of $\infty$-Bench's Longbook.Choice.Eng task, with over 500 chunks of 512 tokens. In RAG methods, this ratio would always be zero, as retrieved blocks are used by all layers concurrently.
  • ...and 9 more figures