Table of Contents
Fetching ...

Modern Methods in Associative Memory

Dmitry Krotov, Benjamin Hoover, Parikshit Ram, Bao Pham

TL;DR

This paper surveys modern energy-based associative memories (AMs), with a focus on DenseAMs that generalize Hopfield networks to dramatically increase storage capacity. It introduces HAMUX, a modular energy framework that decomposes AMs into neuron-layer and hypersynapse components, enabling deep, hierarchical architectures and energy-based analogs of transformer blocks (Energy Transformer). The work connects AM dynamics to diffusion models, showing that diffusion can be interpreted as AM-like memory recall in the small-data regime and as generative modeling in the large-data regime, providing a unifying view of memory, memorization, and generalization. It also situates AMs within broader machine learning practice, illustrating parametric vs nonparametric formulations, supervised and clustering tasks, and kernel-based interpretations, with practical notebooks for hands-on exploration.

Abstract

Associative Memories like the famous Hopfield Networks are elegant models for describing fully recurrent neural networks whose fundamental job is to store and retrieve information. In the past few years they experienced a surge of interest due to novel theoretical results pertaining to their information storage capabilities, and their relationship with SOTA AI architectures, such as Transformers and Diffusion Models. These connections open up possibilities for interpreting the computation of traditional AI networks through the theoretical lens of Associative Memories. Additionally, novel Lagrangian formulations of these networks make it possible to design powerful distributed models that learn useful representations and inform the design of novel architectures. This tutorial provides an approachable introduction to Associative Memories, emphasizing the modern language and methods used in this area of research, with practical hands-on mathematical derivations and coding notebooks.

Modern Methods in Associative Memory

TL;DR

This paper surveys modern energy-based associative memories (AMs), with a focus on DenseAMs that generalize Hopfield networks to dramatically increase storage capacity. It introduces HAMUX, a modular energy framework that decomposes AMs into neuron-layer and hypersynapse components, enabling deep, hierarchical architectures and energy-based analogs of transformer blocks (Energy Transformer). The work connects AM dynamics to diffusion models, showing that diffusion can be interpreted as AM-like memory recall in the small-data regime and as generative modeling in the large-data regime, providing a unifying view of memory, memorization, and generalization. It also situates AMs within broader machine learning practice, illustrating parametric vs nonparametric formulations, supervised and clustering tasks, and kernel-based interpretations, with practical notebooks for hands-on exploration.

Abstract

Associative Memories like the famous Hopfield Networks are elegant models for describing fully recurrent neural networks whose fundamental job is to store and retrieve information. In the past few years they experienced a surge of interest due to novel theoretical results pertaining to their information storage capabilities, and their relationship with SOTA AI architectures, such as Transformers and Diffusion Models. These connections open up possibilities for interpreting the computation of traditional AI networks through the theoretical lens of Associative Memories. Additionally, novel Lagrangian formulations of these networks make it possible to design powerful distributed models that learn useful representations and inform the design of novel architectures. This tutorial provides an approachable introduction to Associative Memories, emphasizing the modern language and methods used in this area of research, with practical hands-on mathematical derivations and coding notebooks.

Paper Structure

This paper contains 34 sections, 1 theorem, 101 equations, 17 figures.

Key Result

Theorem 5.1

A continuous kernel $\kappa({\mathbf{x}}, {\mathbf{x}}') = \kappa({\mathbf{x}} - {\mathbf{x}}')$ on $\mathbb{R}^D$ is positive definite if and only if $\kappa(\Delta)$ is Fourier transform of a non-negative measure $p$.

Figures (17)

  • Figure 1: The form of Associative Memory discussed in this tutorial uses an energy function to unify three important aspects of human cognition: association, memory, and error correction. We are capable of associating images, sights, sounds, smells, and symbols with each other. These associations allow us to retrieve memories using partial or corrupted information, making it a content-addressable memory with error-correction capabilities. The functionality of Associative Memory is modeled by an energy function, where low values of the energy correspond to stored memories and constitute the most likely states of the system.
  • Figure 2: Gaussian probability distribution function. Shaded area indicates the probability of an error or spin flip.
  • Figure 3: HAMUX hypergraph diagrams are a graphical depiction of an AM whose total energy is the sum of the neuron layer (node) and hypersynapse (hyperedge) energies. Inference is done recurrently, modeled by a system of differential equations where each neuron layer's hidden state updates to minimize the total energy. When all non-linearities are captured in the dynamic neurons, inference becomes a local computation that avoids differentiating through non-linearities.
  • Figure 4: Hypersynapses are represented as undirected (hyper)edges in a hypergraph. Shown is an example pairwise synapse, which is a single energy function $E_{xy}(\hat{{\mathbf{x}}}, \hat{{\mathbf{y}}}; {\boldsymbol{\Xi}})$ defined on the activations $\hat{{\mathbf{x}}}$ and $\hat{{\mathbf{y}}}$ from connected nodes, which necessarily propagates signal to both connected nodes. Here, signal is defined as the negative gradient of the interaction energy w.r.t. the connected layer's activations (e.g., layer $\mathsf{X}$ receives signal $\mathcal{I}_x = -\nabla_{\hat{{\mathbf{x}}}} E_{xy}(\hat{{\mathbf{x}}}, \hat{{\mathbf{y}}}; {\boldsymbol{\Xi}})$ while layer $\mathsf{Y}$ receives signal $\mathcal{I}_y = -\nabla_{\hat{{\mathbf{y}}}} E_{xy}(\hat{{\mathbf{x}}}, \hat{{\mathbf{y}}}; {\boldsymbol{\Xi}})$). This is in contrast to biological synapses which are directional and only propagate signal in one direction from layer $\mathsf{X}$ to $\mathsf{Y}$, needing a separate synapse to bring information back from $\mathsf{Y}$ to $\mathsf{X}$.
  • Figure 5: Energy Transformer (ET) describes an energy-based Associative Memory whose gradient looks like a transformer block. Left: Inside the "ET block", the gradient of ET's energy. The input tokens $\mathbf{x}^{(t)}$ passes through a sequence of operations and gets updated to produce the output tokens $\mathbf{x}^{(t+1)}$. The operations inside the ET block are carefully engineered so that the entire network has a global energy function, which decreases with time and is bounded from below. In contrast to conventional transformers, the ET-based analogs of the attention module and the feed-forward MLP module are applied in parallel as opposed to consecutively. Right: The ET block recurrently minimizes the energy of a corrupted image represented by a collection of tokens, where 50% of the tokens are occluded. Shown is an image of a dog not seen when training ET.
  • ...and 12 more figures

Theorems & Definitions (1)

  • Theorem 5.1: Bochner's theorem