Bridging Associative Memory and Probabilistic Modeling

Rylan Schaeffer; Nika Zahedi; Mikail Khona; Dhruv Pai; Sang Truong; Yilun Du; Mitchell Ostrow; Sarthak Chandra; Andres Carranza; Ila Rani Fiete; Andrey Gromov; Sanmi Koyejo

Bridging Associative Memory and Probabilistic Modeling

Rylan Schaeffer, Nika Zahedi, Mikail Khona, Dhruv Pai, Sang Truong, Yilun Du, Mitchell Ostrow, Sarthak Chandra, Andres Carranza, Ila Rani Fiete, Andrey Gromov, Sanmi Koyejo

TL;DR

The paper addresses the gap between associative memory and probabilistic modeling by showing how energy-based formulations correspond to likelihood-based views and by introducing dataset-conditioned energy landscapes, enabling in-context learning of energies. It proposes ICL-EBMs with fixed parameters that adapt to in-context data, along with memory-learning AMs such as CLAM+ELBO and CLAM+CRP+ELBO that connect to Gaussian mixtures and Bayesian nonparametrics, respectively. It further analyzes the memory properties of Gaussian KDEs, proving a finite but exponential storage capacity under well-separated, sphere-constrained data, and provides a theoretical account of pre-normalization before self-attention as hyperspherical clustering with von Mises–Fisher distributions. Overall, the work highlights a two-way exchange of ideas between associative memory and probabilistic modeling, offering theoretical grounding and practical insights for energy-based modeling, memory-augmented architectures, and transformer stability.

Abstract

Associative memory and probabilistic modeling are two fundamental topics in artificial intelligence. The first studies recurrent neural networks designed to denoise, complete and retrieve data, whereas the second studies learning and sampling from probability distributions. Based on the observation that associative memory's energy functions can be seen as probabilistic modeling's negative log likelihoods, we build a bridge between the two that enables useful flow of ideas in both directions. We showcase four examples: First, we propose new energy-based models that flexibly adapt their energy functions to new in-context datasets, an approach we term \textit{in-context learning of energy functions}. Second, we propose two new associative memory models: one that dynamically creates new memories as necessitated by the training data using Bayesian nonparametrics, and another that explicitly computes proportional memory assignments using the evidence lower bound. Third, using tools from associative memory, we analytically and numerically characterize the memory capacity of Gaussian kernel density estimators, a widespread tool in probababilistic modeling. Fourth, we study a widespread implementation choice in transformers -- normalization followed by self attention -- to show it performs clustering on the hypersphere. Altogether, this work urges further exchange of useful ideas between these two continents of artificial intelligence.

Bridging Associative Memory and Probabilistic Modeling

TL;DR

Abstract

Paper Structure (18 sections, 4 theorems, 52 equations, 8 figures, 1 table)

This paper contains 18 sections, 4 theorems, 52 equations, 8 figures, 1 table.

Introduction
In-Context Learning of Energy Functions
Motivation for In-Context Learning of Energy Functions
Learning In-Context Energy Functions
Sampling From In-Context Energy Functions
Experiments for In-Context Learning of Energy Functions
Learning Memories for Associative Memory Models
Connecting Research on Learning Memories
Latent Variable Associative Memory Models
Bayesian Nonparametric Associative Memory Models
Nonparametric Latent Variable Energy Functions
Memory Capacity of Gaussian Kernel Density Estimators
A Theoretical Justification for Pre-Normalization before Self-Attention
Discussion
Implementation Details for In-Context Learning of Energy Functions
...and 3 more sections

Key Result

Theorem C.9

The Gaussian KDE energy function is equivalent to the MCHN energy function.

Figures (8)

Figure 1: In-Context Learning of Energy Functions. Transformers learn to compute energy functions $E_{\theta}^{ICL}(x|\mathcal{D})$ corresponding to probability distributions $p_{\theta}^{ICL}(x|\mathcal{D})$, where $\mathcal{D}$ are in-context datasets that vary during pretraining. At inference, when conditioned on a new in-context dataset, the transformer computes a new energy function using fixed parameters $\theta$. Left-to-Right: The transformers' energy landscapes sharpen as additional in-context data are conditioned upon.
Figure 2: New Associative Memory Models: Latent Variable and Bayesian Nonparametric. We propose two new associative memory models that can compute proportional cluster assignments using the evidence lower bound (top to bottom) and can create new memories using Bayesian nonparametrics (left to right). Applying both together results in an associative memory model capable of creating new memories and simultaneously explicitly computing cluster assignment posteriors.
Figure 3: ClAM, ClAM+ELBO, and various baselines' performance on supervised metrics for standard benchmark datasets. ClAM+ELBO is competitive with ClAM across benchmark tasks in supervised metrics.
Figure 4: ClAM, ClAM+ELBO, and various baselines' performance on unsupervised metrics for standard benchmark datasets. ClAM+ELBO is competitive with ClAM across benchmark tasks in unsupervised metrics.
Figure 5: Energy landscape of new memory creation. Left: Finite mixture models can result in each cluster's basin stretching out infinitely far. Middle and Right: Using the Chinese Restaurant Process, we endow the associative memory model with the ability to create new memories (cluster centroids) if the data is sufficiently far from existing memories: If a datum flows to the origin, we create a new memory for it. Hyperparameter $\alpha$ controls how likely new memories are to be created, with higher $\alpha$ attracting more points to the origin, causing faster cluster creation.
...and 3 more figures

Theorems & Definitions (15)

Definition C.1: Separation of Patterns
Definition C.2: Pattern Storage
Definition C.3: Retrieval Error
Definition C.4: Storage Capacity
Definition C.5: Largest Norm of Training Data
Definition C.6: MCHN Energy Function
Definition C.7: MCHN Dynamics
Theorem C.9
proof
Theorem C.10
...and 5 more

Bridging Associative Memory and Probabilistic Modeling

TL;DR

Abstract

Bridging Associative Memory and Probabilistic Modeling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (15)