Table of Contents
Fetching ...

Memorization to Generalization: Emergence of Diffusion Models from Associative Memory

Bao Pham, Gabriel Raya, Matteo Negri, Mohammed J. Zaki, Luca Ambrogioni, Dmitry Krotov

Abstract

Dense Associative Memories (DenseAMs) are generalizations of Hopfield networks, which have superior information storage capacity and can store training data points (memories) at local minima of the energy landscape. When the amount of training data exceeds the critical memory storage capacity of these models, new local minima, which are different from the training data, emerge. In Associative Memory these emergent local minima are called $\textit{spurious}\; \textit{states}$, which hinder memory retrieval. In this work, we examine diffusion models (DMs) through the DenseAM lens, viewing their generative process as an attempt of a memory retrieval. In the small data regimes, DMs create distinct attractors for each training sample, akin to DenseAMs below the critical memory storage. As the training data size increases, they transition from memorization to generalization. We identify a critical intermediate phase, predicted by DenseAM theory -- the spurious states. In generative modeling, these states are no longer negative artifacts but rather are the first signs of generative capabilities. We characterize the basins of attraction, energy landscape curvature, and computational properties of these previously overlooked states. Their existence is demonstrated across a wide range of architectures and datasets.

Memorization to Generalization: Emergence of Diffusion Models from Associative Memory

Abstract

Dense Associative Memories (DenseAMs) are generalizations of Hopfield networks, which have superior information storage capacity and can store training data points (memories) at local minima of the energy landscape. When the amount of training data exceeds the critical memory storage capacity of these models, new local minima, which are different from the training data, emerge. In Associative Memory these emergent local minima are called , which hinder memory retrieval. In this work, we examine diffusion models (DMs) through the DenseAM lens, viewing their generative process as an attempt of a memory retrieval. In the small data regimes, DMs create distinct attractors for each training sample, akin to DenseAMs below the critical memory storage. As the training data size increases, they transition from memorization to generalization. We identify a critical intermediate phase, predicted by DenseAM theory -- the spurious states. In generative modeling, these states are no longer negative artifacts but rather are the first signs of generative capabilities. We characterize the basins of attraction, energy landscape curvature, and computational properties of these previously overlooked states. Their existence is demonstrated across a wide range of architectures and datasets.

Paper Structure

This paper contains 37 sections, 51 equations, 32 figures, 1 table, 4 algorithms.

Figures (32)

  • Figure 1: Panel (A) schematically illustrates the change in the energy landscape as the size of the training dataset is increased. In the small data regime, the model stores the training data points as local minima of the energy. When the amount of training data exceeds the model's memory capacity, spurious patterns are formed and training data points are no longer energy minima. Subsequent increase of the training data size leads to the generalization phase, defined by the formation of continuous manifold of the low energy states. Examples of memorized, spurious, and generalized samples in their respective columns for four datasets (MNIST, FASHION-MNIST, CIFAR10, and LSUN-CHURCH) are provided in Panel (B), see Sec. (\ref{['sec:transition']}) for our definitions of these three sample types. For each target image (shown on the left), its top-4 nearest neighbors from the training set (top row) and the synthetic set (bottom row) are shown to highlight the novelty and commonality of the target image with respect to its training and synthetic sets. To help highlight the novelty of spurious patterns, we provide rectangular markers guiding where the features differ in their corresponding nearest neighbors.
  • Figure 2: Energy landscape evolution for the 2D toy model as training data size $K$ increases. Models trained at $K \in \{2, 9, 1000\}$, using the VE-SDE based diffusion pipeline from song2021scorebased, with training data sampled from the unit circle (shown in white). Generated samples are shown alongside the learned score field or neural network $s_{\theta}({\mathbf{x}}_t, t)$, aligned with the negative gradient of the energy (\ref{['eq:energy-diffusion']}). Hierarchical clustering identifies structure within the generations, with cluster centroid energies visualized by $\textcolor{cyan}{\boldsymbol{\times}}$ and numerical value. The right-most panel shows the exact solution as $K \rightarrow \infty$ derived in Eq. (\ref{['eqn:toy-model-energy']}). As $K$ grows, the model initially memorizes individual data points, forming isolated basins. Around $K = 9$, spurious patterns, distinct low-energy attractors not present in the data, emerge and signal the onset of generalization. At large $K$, the model enters a fully generalized regime, where low-energy states lie on a flat continuous manifold.
  • Figure 3: Different sample types across the memorization-to-generalization transition for CIFAR10. The grey histogram shows the distances between synthetic samples and their nearest neighbors from the synthetic set $\mathsf{S}'$. The threshold $\delta_s$ defines a boundary between the two peaks. The olive histogram depicts the distances from the synthetic samples to their closest neighbor from the training set $\mathsf{S}$, with threshold $\delta_m$ separating the two peaks. Memorized samples are located in the left peak of the olive histogram, below $\delta_m$. In contrast, generalized and spurious samples appear to the right of $\delta_m$ in the olive histogram. Examples of the generated samples forming each of the four peaks of the histograms are shown in the inset frames. For each generated sample, top-4 nearest neighbors from the training set are shown in the top row, and those from the synthetic set are shown in the bottom row. Training set size $K=7310$ is used in this figure, but the discussed phenomena are general and largely independent of this specific value. The fraction of the memorized, spurious, and generalized samples in the pool of all generated samples is shown in the bottom left panel as a function of the training set size. The inset shows amplified spurious fraction (green curve).
  • Figure 4: Fractions of memorized, spurious, and generalized samples in synthetic sets across training sizes and datasets. As the training data size $K$ increases, memorization decreases and the fraction of generalized samples steadily increases, see top row of Panel (A). The fraction of spurious patterns rises and decreases at the boundary between the memorization and generalization phases, see bottom row of Panel (A). Panel (B) illustrates the average log-volume of the basins of attraction for memorized and spurious samples being statistically larger than that of generalized samples across all datasets. The shaded regions indicate standard deviation of the log-volume.
  • Figure 5: Panel (A) depicts the average singular values of the energy curvature for different sample types and datasets and the shaded region is the standard deviation of the singular values (a particular training set size $K$ is shown). Memorized and spurious samples generally exhibit higher energy curvature, characterized by fewer near-zero singular values than generalized samples. Bottom row of (A) illustrates the average spectra computed for the training data points, where the changing color scheme denotes small to large data sizes, suggesting a drop in curvature (indicated by the decrease in singular values) as $K$ increases. Panel (B) shows candidate examples of memorized, generalized, and spurious samples from a Stable Diffusion model rombach2022high trained on the LAION dataset schuhmann2022laion, each corresponding to a distinct curvature signature. The candidate memorized sample has much larger singular values, while the candidate generalized sample has much smaller ones. The selected possible spurious samples have larger singular values than the candidate generalized sample, but smaller than the candidate memorized sample. The common trait between candidate memorized and spurious samples is that both are stable attractors, as demonstrated by their repeated similar generations given different initial points (or noise vectors) and the conditioning on text-prompt. The y-axis is clipped at the value of 1500 to better contrast the shown examples' spectra.
  • ...and 27 more figures