EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts
Subhajit Chaudhury, Payel Das, Sarathkrishna Swaminathan, Georgios Kollias, Elliot Nelson, Khushbu Pahwa, Tejaswini Pedapati, Igor Melnyk, Matthew Riemer
TL;DR
EpMAN tackles long-context processing in LLMs by introducing episodic memory attention, which reads from a memory of context chunks and reweights the decoder's self-attention via a differentiating attention mechanism. The method uses a memory read with cosine similarity to produce an episodic attention $a_{mem}$ and combines it with standard attention to yield $a_{epman}$, enabling robust chunk-wise relevance weighting. It is trained with synthetic data and a denoising objective, and evaluated with BroadAttn during inference to expand neighborhood context, yielding superior recall and LV-Eval QA performance across 16k–256k contexts. The results suggest EpMAN offers a scalable, robust alternative to purely self-attentive or RAG-based long-context strategies, with practical implications for memory-augmented LLMs.
Abstract
Recent advances in Large Language Models (LLMs) have yielded impressive successes on many language tasks. However, efficient processing of long contexts using LLMs remains a significant challenge. We introduce \textbf{EpMAN} -- a method for processing long contexts in an \textit{episodic memory} module while \textit{holistically attending to} semantically relevant context chunks. The output of \textit{episodic attention} is then used to reweigh the decoder's self-attention to the stored KV cache of the context during training and generation. When an LLM decoder is trained using \textbf{EpMAN}, its performance on multiple challenging single-hop long-context recall and question-answering benchmarks is found to be stronger and more robust across the range from 16k to 256k tokens than baseline decoders trained with self-attention, and popular retrieval-augmented generation frameworks.
