Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy

Abdulrahman Alswaidan; Jeffrey D. Varner

Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy

Abdulrahman Alswaidan, Jeffrey D. Varner

TL;DR

A closed-form entropy inflection condition is derived that identifies the retrieval-to-generation transition temperature for any memory geometry, with a scaling law $\beta^*\!\sim\!\sqrt{d}$ for random patterns.

Abstract

Attention heads retrieve: given a query, they return a softmax-weighted average of stored values. We show that this computation is one step of gradient descent on a classical energy function, and that Langevin sampling from the corresponding distribution yields stochastic attention: a training-free sampler controlled by a single temperature. Lowering the temperature gives exact retrieval; raising it gives open-ended generation. Because the energy gradient equals the attention map, no score network, training loop, or learned model is required. We derive a closed-form entropy inflection condition that identifies the retrieval-to-generation transition temperature for any memory geometry, with a scaling law $β^*\!\sim\!\sqrt{d}$ for random patterns. We validate on five domains (64 to 4,096 dimensions). On MNIST digit images, stochastic attention is $2.6{\times}$ more novel and $2.0{\times}$ more diverse than the best learned baseline (a VAE trained on the same patterns), while matching a Metropolis-corrected gold standard. On protein sequences from the Pfam RRM family, the generation regime achieves $6.9{\times}$ lower amino acid composition divergence than the VAE (KL $= 0.060$ vs.\ $0.416$) at matched novelty, demonstrating that the training-free score function preserves family-level fidelity that learned models lose. A denoising diffusion baseline (DDPM) fails across all memory sizes tested ($K = 100$ to $3{,}500$), producing samples indistinguishable from isotropic noise. The approach requires no architectural changes to the underlying attention mechanism.

Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy

TL;DR

A closed-form entropy inflection condition is derived that identifies the retrieval-to-generation transition temperature for any memory geometry, with a scaling law

for random patterns.

Abstract

for random patterns. We validate on five domains (64 to 4,096 dimensions). On MNIST digit images, stochastic attention is

more novel and

more diverse than the best learned baseline (a VAE trained on the same patterns), while matching a Metropolis-corrected gold standard. On protein sequences from the Pfam RRM family, the generation regime achieves

lower amino acid composition divergence than the VAE (KL

vs.\

) at matched novelty, demonstrating that the training-free score function preserves family-level fidelity that learned models lose. A denoising diffusion baseline (DDPM) fails across all memory sizes tested (

), producing samples indistinguishable from isotropic noise. The approach requires no architectural changes to the underlying attention mechanism.

Paper Structure (101 sections, 3 theorems, 43 equations, 18 figures, 15 tables, 3 algorithms)

This paper contains 101 sections, 3 theorems, 43 equations, 18 figures, 15 tables, 3 algorithms.

Introduction
Related Work
Hopfield networks and associative memory.
Energy-based, score-based and diffusion models.
Background
Attention as deterministic retrieval.
The energy landscape beneath attention.
The bridge.
From minimization to sampling.
Method
Stochastic Attention Update
Properties and Limiting Behavior
Beyond the convex regime.
Phase Transition and Temperature Selection
Scaling for random memories.
...and 86 more sections

Key Result

Proposition 1

Let $\sigma_{\max}=\|\mathbf{X}\|_{\mathrm{op}}$ denote the largest singular value of the memory matrix. The modern Hopfield energy eq:modern-hopfield-energy has Lipschitz-continuous gradient with constant $L = 1 + \beta\sigma_{\max}^{2}/2$ and satisfies the dissipativity condition $\langle \nabla E

Figures (18)

Figure 1: Synthetic experiments. (a) Phase behavior as a function of inverse temperature $\beta$ ($d=64$, $K=16$). Left axis (blue): mean cosine similarity to the nearest stored pattern; right axis (coral): scaled entropy $H(\mathbf{a})/\log K$. Both diagnostics reveal a smooth transition centered near $\beta\approx 5\text{--}10$ (gold band). (b) Convergence validation ($d=8$, $K=4$, $\beta=5$). Pooled energy density from eight independent chains (gray) overlaid on a long-run reference distribution (coral); inset reports the Kolmogorov--Smirnov statistic and moment differences.
Figure 2: Generated MNIST digit "3" samples ($4\times 4$ grids) at $\beta{=}2000$ ($\mathrm{SNR}{=}0.113$, structured-retrieval regime). Bootstrap outputs are exact copies of stored images. Gaussian perturbation adds unstructured noise. Random convex combinations produce blurry averages. MALA and our ULA-based stochastic attention sampler ($\beta$ controls the operating mode: $\beta{=}2000$ for structured retrieval, $\beta{=}200$ for generation) produce visually indistinguishable diverse, structured digits, confirming that the Metropolis correction is unnecessary at step size $\alpha{=}0.01$.
Figure 3: Phase diagram of attention concentration $C = 1 - H(\mathbf{a})/\log K$ over load ratio $K/d$ (horizontal) and inverse temperature $\beta$ (vertical, log scale), with $d=64$. Each cell averages over five independent datasets. The dashed contour marks $C=0.5$, separating a retrieval regime (upper-left, warm colors) from a diffuse regime (lower-right, dark colors).
Figure 4: Generated MNIST digit "1" samples ($4{\times}4$ grids). The pattern matches digit "3": only the Langevin-based methods (d, e) produce diverse, structured outputs.
Figure 5: Generated MNIST digit "8" samples ($4{\times}4$ grids). Despite digit "8" having higher intra-class variance and more complex topology than digit "3", the qualitative pattern is unchanged: Langevin methods dominate all baselines.
...and 13 more figures

Theorems & Definitions (4)

Proposition 1
Corollary 2
Proposition 3
proof

Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy

TL;DR

Abstract

Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (4)