Memory Mosaics
Jianyu Zhang, Niklas Nolte, Ranajoy Sadhukhan, Beidi Chen, Léon Bottou
TL;DR
Memory Mosaics address interpretability and compositional learning in sequence modeling by organizing multiple associative memories that retrieve via Gaussian kernel smoothing. The approach reframes attention as kernel-based retrieval and introduces predictive disentanglement, a training-time decomposition that assigns sub-tasks to individual memories. The paper shows that Memory Mosaics match the i.i.d. performance of decoding transformers on language modeling and can outperform them on out-of-distribution tasks such as in-context learning, with demonstrations on a toy three-moons problem and medium-scale language modeling. This work suggests a principled, interpretable alternative to fully attention-based models and highlights memory-based architectures as a promising path for scalable, modular, and transparent sequence learning.
Abstract
Memory Mosaics are networks of associative memories working in concert to achieve a prediction task of interest. Like transformers, memory mosaics possess compositional capabilities and in-context learning capabilities. Unlike transformers, memory mosaics achieve these capabilities in comparatively transparent way ("predictive disentanglement"). We illustrate these capabilities on a toy example and also show that memory mosaics perform as well or better than transformers on medium-scale language modeling tasks.
