Table of Contents
Fetching ...

Extended Mind Transformers

Phoebe Klett, Thomas Ahle

TL;DR

Extended Mind Transformers address the long-context bottleneck in pre-trained language models by integrating an internal cache of external memories through top-k attention across decoder layers without fine-tuning. The method leverages appropriate position encodings to make retrieved memories usable within current attention, and demonstrates that retrieving information in most decoder layers yields strong performance. A new counterfactual long-range retrieval benchmark shows competitive results with state-of-the-art methods, including a 6% average improvement when combined with RAG, while maintaining favorable inference-time characteristics due to upfront memory generation. Additionally, the approach enables interpretable, token-level citations of retrieved memories and supports uncertainty-driven active learning to potentially reduce hallucinations.

Abstract

Pre-trained language models demonstrate general intelligence and common sense, but long inputs quickly become a bottleneck for memorizing information at inference time. We resurface a simple method, Memorizing Transformers (Wu et al., 2022), that gives the model access to a bank of pre-computed memories. We show that it is possible to fix many of the shortcomings of the original method, such as the need for fine-tuning, by critically assessing how positional encodings should be updated for the keys and values retrieved. This intuitive method uses the model's own key/query system to select and attend to the most relevant memories at each generation step, rather than using external embeddings. We demonstrate the importance of external information being retrieved in a majority of decoder layers, contrary to previous work. We open source a new counterfactual long-range retrieval benchmark, and show that Extended Mind Transformers outperform today's state of the art by 6% on average.

Extended Mind Transformers

TL;DR

Extended Mind Transformers address the long-context bottleneck in pre-trained language models by integrating an internal cache of external memories through top-k attention across decoder layers without fine-tuning. The method leverages appropriate position encodings to make retrieved memories usable within current attention, and demonstrates that retrieving information in most decoder layers yields strong performance. A new counterfactual long-range retrieval benchmark shows competitive results with state-of-the-art methods, including a 6% average improvement when combined with RAG, while maintaining favorable inference-time characteristics due to upfront memory generation. Additionally, the approach enables interpretable, token-level citations of retrieved memories and supports uncertainty-driven active learning to potentially reduce hallucinations.

Abstract

Pre-trained language models demonstrate general intelligence and common sense, but long inputs quickly become a bottleneck for memorizing information at inference time. We resurface a simple method, Memorizing Transformers (Wu et al., 2022), that gives the model access to a bank of pre-computed memories. We show that it is possible to fix many of the shortcomings of the original method, such as the need for fine-tuning, by critically assessing how positional encodings should be updated for the keys and values retrieved. This intuitive method uses the model's own key/query system to select and attend to the most relevant memories at each generation step, rather than using external embeddings. We demonstrate the importance of external information being retrieved in a majority of decoder layers, contrary to previous work. We open source a new counterfactual long-range retrieval benchmark, and show that Extended Mind Transformers outperform today's state of the art by 6% on average.
Paper Structure (27 sections, 1 equation, 9 figures, 2 tables)

This paper contains 27 sections, 1 equation, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overview of attention over memories and local context, where Q is length of queries, KV is length of key-values, HD is head dimension, and K is a memory hyper-parameter. Arrows show queries retrieving memories, gradient colors represent inner product scores and softmax.
  • Figure 2: Average perplexity on sequences of increasing input lengths. Results shown for baselines and Extended Mind Llama-2-7b.
  • Figure 3: Average perplexity on sequences of increasing input lengths. Increasing k corresponds to retrieving more key-values pairs.
  • Figure 4: Fact retrieval accuracy over various document lengths for Extended Mind Llama-2-70b, RAG and state of the art baselines.
  • Figure 5: Retrieval accuracy over various document lengths for Extended Mind Llama-2-7b, and long context baselines.
  • ...and 4 more figures