Extended Mind Transformers
Phoebe Klett, Thomas Ahle
TL;DR
Extended Mind Transformers address the long-context bottleneck in pre-trained language models by integrating an internal cache of external memories through top-k attention across decoder layers without fine-tuning. The method leverages appropriate position encodings to make retrieved memories usable within current attention, and demonstrates that retrieving information in most decoder layers yields strong performance. A new counterfactual long-range retrieval benchmark shows competitive results with state-of-the-art methods, including a 6% average improvement when combined with RAG, while maintaining favorable inference-time characteristics due to upfront memory generation. Additionally, the approach enables interpretable, token-level citations of retrieved memories and supports uncertainty-driven active learning to potentially reduce hallucinations.
Abstract
Pre-trained language models demonstrate general intelligence and common sense, but long inputs quickly become a bottleneck for memorizing information at inference time. We resurface a simple method, Memorizing Transformers (Wu et al., 2022), that gives the model access to a bank of pre-computed memories. We show that it is possible to fix many of the shortcomings of the original method, such as the need for fine-tuning, by critically assessing how positional encodings should be updated for the keys and values retrieved. This intuitive method uses the model's own key/query system to select and attend to the most relevant memories at each generation step, rather than using external embeddings. We demonstrate the importance of external information being retrieved in a majority of decoder layers, contrary to previous work. We open source a new counterfactual long-range retrieval benchmark, and show that Extended Mind Transformers outperform today's state of the art by 6% on average.
