Table of Contents
Fetching ...

MemLong: Memory-Augmented Retrieval for Long Text Modeling

Weijie Liu, Zecheng Tang, Juntao Li, Kehai Chen, Min Zhang

TL;DR

MemLong tackles the challenge of long-context language modeling by coupling a fixed, pre-trained decoder with an external retriever and a memory bank that stores chunk-level K-V pairs and representations. A retrieval causal attention mechanism fuses retrieved memory with local context in the upper layers while keeping the lower layers frozen, enabling context extension up to 80k tokens on a single GPU. The approach yields consistent perplexity improvements across long-context benchmarks and enhances in-context learning using in-memory demonstrations, all with favorable efficiency due to selective memory updates. This enables practical long-range document processing and scalable retrieval-augmented generation without wholesale model retraining. Overall, MemLong demonstrates strong performance gains and a feasible path to significantly longer context windows in decoder-only LLMs.

Abstract

Recent advancements in Large Language Models (LLMs) have yielded remarkable success across diverse fields. However, handling long contexts remains a significant challenge for LLMs due to the quadratic time and space complexity of attention mechanisms and the growing memory consumption of the key-value cache during generation. This work introduces MemLong: Memory-Augmented Retrieval for Long Text Generation, a method designed to enhance the capabilities of long-context language modeling by utilizing an external retriever for historical information retrieval. MemLong combines a non-differentiable ``ret-mem'' module with a partially trainable decoder-only language model and introduces a fine-grained, controllable retrieval attention mechanism that leverages semantic-level relevant chunks. Comprehensive evaluations on multiple long-context language modeling benchmarks demonstrate that MemLong consistently outperforms other state-of-the-art LLMs. More importantly, MemLong can extend the context length on a single 3090 GPU from 4k up to 80k. Our code is available at https://github.com/Bui1dMySea/MemLong

MemLong: Memory-Augmented Retrieval for Long Text Modeling

TL;DR

MemLong tackles the challenge of long-context language modeling by coupling a fixed, pre-trained decoder with an external retriever and a memory bank that stores chunk-level K-V pairs and representations. A retrieval causal attention mechanism fuses retrieved memory with local context in the upper layers while keeping the lower layers frozen, enabling context extension up to 80k tokens on a single GPU. The approach yields consistent perplexity improvements across long-context benchmarks and enhances in-context learning using in-memory demonstrations, all with favorable efficiency due to selective memory updates. This enables practical long-range document processing and scalable retrieval-augmented generation without wholesale model retraining. Overall, MemLong demonstrates strong performance gains and a feasible path to significantly longer context windows in decoder-only LLMs.

Abstract

Recent advancements in Large Language Models (LLMs) have yielded remarkable success across diverse fields. However, handling long contexts remains a significant challenge for LLMs due to the quadratic time and space complexity of attention mechanisms and the growing memory consumption of the key-value cache during generation. This work introduces MemLong: Memory-Augmented Retrieval for Long Text Generation, a method designed to enhance the capabilities of long-context language modeling by utilizing an external retriever for historical information retrieval. MemLong combines a non-differentiable ``ret-mem'' module with a partially trainable decoder-only language model and introduces a fine-grained, controllable retrieval attention mechanism that leverages semantic-level relevant chunks. Comprehensive evaluations on multiple long-context language modeling benchmarks demonstrate that MemLong consistently outperforms other state-of-the-art LLMs. More importantly, MemLong can extend the context length on a single 3090 GPU from 4k up to 80k. Our code is available at https://github.com/Bui1dMySea/MemLong
Paper Structure (32 sections, 4 equations, 5 figures, 4 tables)

This paper contains 32 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of Retrieval-Augment Generation(RAG) and Memory-Retrieval flow of MemLong. (a) RAG can even degrade the generation performance (yellow) when the length of the retrieved information exceeds the model's processing capacity. (b) Our approach utilizes an external retriever to fetch historical information, which is then passed into the model as $\mathtt{K}\hbox{-}\mathtt{V}$ pairs rather than in text form.
  • Figure 2: An example of MemLong : In the lower layers, where the model remains static, causal language modeling is performed on the entire chunk $c_i$, and subsequently, $c_i$ is cached in both embedding and $\mathtt{K}\hbox{-}\mathtt{V}$ pair forms. Lastly, the upper layers are finetuned to harmonize retrieval preferences and integrate the retrieved content.
  • Figure 3: Illustration of retrieval causal attention. Local causal attention is applied to the recent context, while chunk-level $\mathtt{K}\hbox{-}\mathtt{V}$ pairs, obtained through the retrieval method, enable bidirectional attention without information leakage due to their historical nature.
  • Figure 4: Degree of PPL during the training phase. The indicator for the y-axis is PPL. We mainly focus on training params and retrieval layers. We provide the specific parameter settings of each line in \ref{['sec:1']}.
  • Figure 5: Evaluating different datasets at various memory sizes.In each subplot, all parameters are the same except for the memory size.