Table of Contents
Fetching ...

Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, Zhouhan Lin

TL;DR

Memory Decoder introduces a plug-and-play pretrained memory that imitates non-parametric retrieval to enable domain adaptation of large language models without modifying their parameters. A small transformer decoder is pretrained to align its output with kNN-distribution signals and is interpolated with the base model during inference to deliver domain-specific knowledge with minimal latency. Across biomedicine, finance, and law, MemDec improves perplexity and preserves zero-shot generalization while enabling cross-model and cross-vocabulary transfer. This modular approach reduces deployment costs and latency, offering a practical, scalable path to domain specialization for diverse LM architectures.

Abstract

Large Language Models (LLMs) have shown strong abilities in general language tasks, yet adapting them to specific domains remains a challenge. Current method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter training and suffers from catastrophic forgetting. Meanwhile, Retrieval-Augmented Generation (RAG) introduces substantial inference latency due to expensive nearest-neighbor searches and longer context. This paper introduces Memory Decoder, a plug-and-play pretrained memory that enables efficient domain adaptation without changing the original model's parameters. Memory Decoder employs a small transformer decoder that learns to imitate the behavior of an external non-parametric retriever. Once trained, Memory Decoder can be seamlessly integrated with any pretrained language model that shares the same tokenizer, requiring no model-specific modifications. Experimental results demonstrate that Memory Decoder enables effective adaptation of various Qwen and Llama models to three distinct specialized domains: biomedicine, finance, and law, reducing perplexity by an average of 6.17 points. Overall, Memory Decoder introduces a novel paradigm centered on a specially pretrained memory component designed for domain-specific adaptation. This memory architecture can be integrated in a plug-and-play manner, consistently enhancing performance across multiple models within the target domain.

Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

TL;DR

Memory Decoder introduces a plug-and-play pretrained memory that imitates non-parametric retrieval to enable domain adaptation of large language models without modifying their parameters. A small transformer decoder is pretrained to align its output with kNN-distribution signals and is interpolated with the base model during inference to deliver domain-specific knowledge with minimal latency. Across biomedicine, finance, and law, MemDec improves perplexity and preserves zero-shot generalization while enabling cross-model and cross-vocabulary transfer. This modular approach reduces deployment costs and latency, offering a practical, scalable path to domain specialization for diverse LM architectures.

Abstract

Large Language Models (LLMs) have shown strong abilities in general language tasks, yet adapting them to specific domains remains a challenge. Current method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter training and suffers from catastrophic forgetting. Meanwhile, Retrieval-Augmented Generation (RAG) introduces substantial inference latency due to expensive nearest-neighbor searches and longer context. This paper introduces Memory Decoder, a plug-and-play pretrained memory that enables efficient domain adaptation without changing the original model's parameters. Memory Decoder employs a small transformer decoder that learns to imitate the behavior of an external non-parametric retriever. Once trained, Memory Decoder can be seamlessly integrated with any pretrained language model that shares the same tokenizer, requiring no model-specific modifications. Experimental results demonstrate that Memory Decoder enables effective adaptation of various Qwen and Llama models to three distinct specialized domains: biomedicine, finance, and law, reducing perplexity by an average of 6.17 points. Overall, Memory Decoder introduces a novel paradigm centered on a specially pretrained memory component designed for domain-specific adaptation. This memory architecture can be integrated in a plug-and-play manner, consistently enhancing performance across multiple models within the target domain.

Paper Structure

This paper contains 45 sections, 11 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Comparison of domain adaptation approaches. DAPT (left) requires separate pre-training for each model size, modifying original parameters. RAG (middle) maintains model parameters but requires expensive retrieval from external datastores during inference. Memory Decoder (right) offers a plug-and-play solution where a single pretrained memory component can be interpolated with models of different sizes, avoiding both parameter modification and retrieval overhead.
  • Figure 2: Perplexity comparison of Qwen2.5 models augmented by Memory Decoder and LoRA adapter of the same param count on the finance domain.
  • Figure 3: Overview of Memory Decoder architecture. Upper§ \ref{['sec:pretraining']}: During pre-training, Memory Decoder learns to align its output distributions with those generated by non-parametric retrievers through distribution alignment loss. Lower§ \ref{['sec:inference']}: During inference, Memory Decoder processes input in parallel with the base LLM, and their distributions are interpolated to produce domain-enhanced predictions without retrieval overhead.
  • Figure 4: Inference latency comparison across domain adaptation methods. These measurements were conducted on Qwen2.5-1.5B yang2024qwen2 for biomedicine domain text, augmented by a 0.5B Memory Decoder.
  • Figure 5: Probability distributions from $k$-NN retrieval, standard LM, and Memory Decoder for GPT-2-Large. The $k$-NN distribution shows extreme sparsity with concentrated probability mass.
  • ...and 1 more figures