Bridging Associative Memory and Probabilistic Modeling
Rylan Schaeffer, Nika Zahedi, Mikail Khona, Dhruv Pai, Sang Truong, Yilun Du, Mitchell Ostrow, Sarthak Chandra, Andres Carranza, Ila Rani Fiete, Andrey Gromov, Sanmi Koyejo
TL;DR
The paper addresses the gap between associative memory and probabilistic modeling by showing how energy-based formulations correspond to likelihood-based views and by introducing dataset-conditioned energy landscapes, enabling in-context learning of energies. It proposes ICL-EBMs with fixed parameters that adapt to in-context data, along with memory-learning AMs such as CLAM+ELBO and CLAM+CRP+ELBO that connect to Gaussian mixtures and Bayesian nonparametrics, respectively. It further analyzes the memory properties of Gaussian KDEs, proving a finite but exponential storage capacity under well-separated, sphere-constrained data, and provides a theoretical account of pre-normalization before self-attention as hyperspherical clustering with von Mises–Fisher distributions. Overall, the work highlights a two-way exchange of ideas between associative memory and probabilistic modeling, offering theoretical grounding and practical insights for energy-based modeling, memory-augmented architectures, and transformer stability.
Abstract
Associative memory and probabilistic modeling are two fundamental topics in artificial intelligence. The first studies recurrent neural networks designed to denoise, complete and retrieve data, whereas the second studies learning and sampling from probability distributions. Based on the observation that associative memory's energy functions can be seen as probabilistic modeling's negative log likelihoods, we build a bridge between the two that enables useful flow of ideas in both directions. We showcase four examples: First, we propose new energy-based models that flexibly adapt their energy functions to new in-context datasets, an approach we term \textit{in-context learning of energy functions}. Second, we propose two new associative memory models: one that dynamically creates new memories as necessitated by the training data using Bayesian nonparametrics, and another that explicitly computes proportional memory assignments using the evidence lower bound. Third, using tools from associative memory, we analytically and numerically characterize the memory capacity of Gaussian kernel density estimators, a widespread tool in probababilistic modeling. Fourth, we study a widespread implementation choice in transformers -- normalization followed by self attention -- to show it performs clustering on the hypersphere. Altogether, this work urges further exchange of useful ideas between these two continents of artificial intelligence.
