Table of Contents
Fetching ...

Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy

Abdulrahman Alswaidan, Jeffrey D. Varner

TL;DR

A closed-form entropy inflection condition is derived that identifies the retrieval-to-generation transition temperature for any memory geometry, with a scaling law $\beta^*\!\sim\!\sqrt{d}$ for random patterns.

Abstract

Attention heads retrieve: given a query, they return a softmax-weighted average of stored values. We show that this computation is one step of gradient descent on a classical energy function, and that Langevin sampling from the corresponding distribution yields stochastic attention: a training-free sampler controlled by a single temperature. Lowering the temperature gives exact retrieval; raising it gives open-ended generation. Because the energy gradient equals the attention map, no score network, training loop, or learned model is required. We derive a closed-form entropy inflection condition that identifies the retrieval-to-generation transition temperature for any memory geometry, with a scaling law $β^*\!\sim\!\sqrt{d}$ for random patterns. We validate on five domains (64 to 4,096 dimensions). On MNIST digit images, stochastic attention is $2.6{\times}$ more novel and $2.0{\times}$ more diverse than the best learned baseline (a VAE trained on the same patterns), while matching a Metropolis-corrected gold standard. On protein sequences from the Pfam RRM family, the generation regime achieves $6.9{\times}$ lower amino acid composition divergence than the VAE (KL $= 0.060$ vs.\ $0.416$) at matched novelty, demonstrating that the training-free score function preserves family-level fidelity that learned models lose. A denoising diffusion baseline (DDPM) fails across all memory sizes tested ($K = 100$ to $3{,}500$), producing samples indistinguishable from isotropic noise. The approach requires no architectural changes to the underlying attention mechanism.

Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy

TL;DR

A closed-form entropy inflection condition is derived that identifies the retrieval-to-generation transition temperature for any memory geometry, with a scaling law for random patterns.

Abstract

Attention heads retrieve: given a query, they return a softmax-weighted average of stored values. We show that this computation is one step of gradient descent on a classical energy function, and that Langevin sampling from the corresponding distribution yields stochastic attention: a training-free sampler controlled by a single temperature. Lowering the temperature gives exact retrieval; raising it gives open-ended generation. Because the energy gradient equals the attention map, no score network, training loop, or learned model is required. We derive a closed-form entropy inflection condition that identifies the retrieval-to-generation transition temperature for any memory geometry, with a scaling law for random patterns. We validate on five domains (64 to 4,096 dimensions). On MNIST digit images, stochastic attention is more novel and more diverse than the best learned baseline (a VAE trained on the same patterns), while matching a Metropolis-corrected gold standard. On protein sequences from the Pfam RRM family, the generation regime achieves lower amino acid composition divergence than the VAE (KL vs.\ ) at matched novelty, demonstrating that the training-free score function preserves family-level fidelity that learned models lose. A denoising diffusion baseline (DDPM) fails across all memory sizes tested ( to ), producing samples indistinguishable from isotropic noise. The approach requires no architectural changes to the underlying attention mechanism.
Paper Structure (101 sections, 3 theorems, 43 equations, 18 figures, 15 tables, 3 algorithms)

This paper contains 101 sections, 3 theorems, 43 equations, 18 figures, 15 tables, 3 algorithms.

Key Result

Proposition 1

Let $\sigma_{\max}=\|\mathbf{X}\|_{\mathrm{op}}$ denote the largest singular value of the memory matrix. The modern Hopfield energy eq:modern-hopfield-energy has Lipschitz-continuous gradient with constant $L = 1 + \beta\sigma_{\max}^{2}/2$ and satisfies the dissipativity condition $\langle \nabla E

Figures (18)

  • Figure 1: Synthetic experiments. (a) Phase behavior as a function of inverse temperature $\beta$ ($d=64$, $K=16$). Left axis (blue): mean cosine similarity to the nearest stored pattern; right axis (coral): scaled entropy $H(\mathbf{a})/\log K$. Both diagnostics reveal a smooth transition centered near $\beta\approx 5\text{--}10$ (gold band). (b) Convergence validation ($d=8$, $K=4$, $\beta=5$). Pooled energy density from eight independent chains (gray) overlaid on a long-run reference distribution (coral); inset reports the Kolmogorov--Smirnov statistic and moment differences.
  • Figure 2: Generated MNIST digit "3" samples ($4\times 4$ grids) at $\beta{=}2000$ ($\mathrm{SNR}{=}0.113$, structured-retrieval regime). Bootstrap outputs are exact copies of stored images. Gaussian perturbation adds unstructured noise. Random convex combinations produce blurry averages. MALA and our ULA-based stochastic attention sampler ($\beta$ controls the operating mode: $\beta{=}2000$ for structured retrieval, $\beta{=}200$ for generation) produce visually indistinguishable diverse, structured digits, confirming that the Metropolis correction is unnecessary at step size $\alpha{=}0.01$.
  • Figure 3: Phase diagram of attention concentration $C = 1 - H(\mathbf{a})/\log K$ over load ratio $K/d$ (horizontal) and inverse temperature $\beta$ (vertical, log scale), with $d=64$. Each cell averages over five independent datasets. The dashed contour marks $C=0.5$, separating a retrieval regime (upper-left, warm colors) from a diffuse regime (lower-right, dark colors).
  • Figure 4: Generated MNIST digit "1" samples ($4{\times}4$ grids). The pattern matches digit "3": only the Langevin-based methods (d, e) produce diverse, structured outputs.
  • Figure 5: Generated MNIST digit "8" samples ($4{\times}4$ grids). Despite digit "8" having higher intra-class variance and more complex topology than digit "3", the qualitative pattern is unchanged: Langevin methods dominate all baselines.
  • ...and 13 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Corollary 2
  • Proposition 3
  • proof