Table of Contents
Fetching ...

EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts

Subhajit Chaudhury, Payel Das, Sarathkrishna Swaminathan, Georgios Kollias, Elliot Nelson, Khushbu Pahwa, Tejaswini Pedapati, Igor Melnyk, Matthew Riemer

TL;DR

EpMAN tackles long-context processing in LLMs by introducing episodic memory attention, which reads from a memory of context chunks and reweights the decoder's self-attention via a differentiating attention mechanism. The method uses a memory read with cosine similarity to produce an episodic attention $a_{mem}$ and combines it with standard attention to yield $a_{epman}$, enabling robust chunk-wise relevance weighting. It is trained with synthetic data and a denoising objective, and evaluated with BroadAttn during inference to expand neighborhood context, yielding superior recall and LV-Eval QA performance across 16k–256k contexts. The results suggest EpMAN offers a scalable, robust alternative to purely self-attentive or RAG-based long-context strategies, with practical implications for memory-augmented LLMs.

Abstract

Recent advances in Large Language Models (LLMs) have yielded impressive successes on many language tasks. However, efficient processing of long contexts using LLMs remains a significant challenge. We introduce \textbf{EpMAN} -- a method for processing long contexts in an \textit{episodic memory} module while \textit{holistically attending to} semantically relevant context chunks. The output of \textit{episodic attention} is then used to reweigh the decoder's self-attention to the stored KV cache of the context during training and generation. When an LLM decoder is trained using \textbf{EpMAN}, its performance on multiple challenging single-hop long-context recall and question-answering benchmarks is found to be stronger and more robust across the range from 16k to 256k tokens than baseline decoders trained with self-attention, and popular retrieval-augmented generation frameworks.

EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts

TL;DR

EpMAN tackles long-context processing in LLMs by introducing episodic memory attention, which reads from a memory of context chunks and reweights the decoder's self-attention via a differentiating attention mechanism. The method uses a memory read with cosine similarity to produce an episodic attention and combines it with standard attention to yield , enabling robust chunk-wise relevance weighting. It is trained with synthetic data and a denoising objective, and evaluated with BroadAttn during inference to expand neighborhood context, yielding superior recall and LV-Eval QA performance across 16k–256k contexts. The results suggest EpMAN offers a scalable, robust alternative to purely self-attentive or RAG-based long-context strategies, with practical implications for memory-augmented LLMs.

Abstract

Recent advances in Large Language Models (LLMs) have yielded impressive successes on many language tasks. However, efficient processing of long contexts using LLMs remains a significant challenge. We introduce \textbf{EpMAN} -- a method for processing long contexts in an \textit{episodic memory} module while \textit{holistically attending to} semantically relevant context chunks. The output of \textit{episodic attention} is then used to reweigh the decoder's self-attention to the stored KV cache of the context during training and generation. When an LLM decoder is trained using \textbf{EpMAN}, its performance on multiple challenging single-hop long-context recall and question-answering benchmarks is found to be stronger and more robust across the range from 16k to 256k tokens than baseline decoders trained with self-attention, and popular retrieval-augmented generation frameworks.

Paper Structure

This paper contains 26 sections, 2 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: EpMAN uses episodic attention and noisy training for robust long context performance on recall and QA tasks (mean over 16k - 256k context lengths)
  • Figure 2: CFI and KPR in LV-Eval dataset.
  • Figure 3: LLM-as-Judge prompt that was used to measure the performance of the MultiFieldQA and LoogleQA