Needle in the Haystack for Memory Based Large Language Models
Elliot Nelson, Georgios Kollias, Payel Das, Subhajit Chaudhury, Soham Dan
TL;DR
The paper addresses the limitation of standard LLMs in recalling facts over long prompts by augmenting Larimar with a dynamically updatable external associative memory. It details a memory-writing and memory-reading mechanism, with keys derived from a fixed key memory and CPU-based memory storage to scale context length without taxing GPU resources. Two long-context tasks (passkey and needle-in-the-haystack) show that a 1.3B-parameter Larimar can maintain strong recall up to 100K–1M tokens without task-specific training, outperforming or matching larger baselines in some settings. This work demonstrates a scalable path to extending long-context capabilities in smaller LLMs through external memory, offering practical benefits for real-world, memory-intensive reasoning tasks.
Abstract
Current large language models (LLMs) often perform poorly on simple fact retrieval tasks. Here we investigate if coupling a dynamically adaptable external memory to a LLM can alleviate this problem. For this purpose, we test Larimar, a recently proposed language model architecture which uses an external associative memory, on long-context recall tasks including passkey and needle-in-the-haystack tests. We demonstrate that the external memory of Larimar, which allows fast write and read of an episode of text samples, can be used at test time to handle contexts much longer than those seen during training. We further show that the latent readouts from the memory (to which long contexts are written) control the decoder towards generating correct outputs, with the memory stored off of the GPU. Compared to existing transformer-based LLM architectures for long-context recall tasks that use larger parameter counts or modified attention mechanisms, a relatively smaller size Larimar is able to maintain strong performance without any task-specific training or training on longer contexts.
