Table of Contents
Fetching ...

Dynamical Mean-Field Theory of Self-Attention Neural Networks

Ángel Poc-López, Miguel Aguilera

TL;DR

The paper tackles the problem of understanding transformer dynamics by mapping self-attention to an asymmetric Hopfield network and formulating a nonequilibrium dynamical mean-field theory that uses a generating functional $Z(g)$, exact in the large-$N$ limit. With 1-bit token encodings, it derives closed-form mean-field updates for overlaps $m^\alpha_{a,t}$ and normalized attentions $\hat{A}^a_t$ and discusses a simplified softmax output. It observes nonequilibrium phase transitions leading to periodic, quasi-periodic, and chaotic attractors, with memory effects extending beyond the short context window. The approach promises interpretability benefits and potential reductions in training cost and can be extended to more realistic transformer settings and finite-size corrections.

Abstract

Transformer-based models have demonstrated exceptional performance across diverse domains, becoming the state-of-the-art solution for addressing sequential machine learning problems. Even though we have a general understanding of the fundamental components in the transformer architecture, little is known about how they operate or what are their expected dynamics. Recently, there has been an increasing interest in exploring the relationship between attention mechanisms and Hopfield networks, promising to shed light on the statistical physics of transformer networks. However, to date, the dynamical regimes of transformer-like models have not been studied in depth. In this paper, we address this gap by using methods for the study of asymmetric Hopfield networks in nonequilibrium regimes --namely path integral methods over generating functionals, yielding dynamics governed by concurrent mean-field variables. Assuming 1-bit tokens and weights, we derive analytical approximations for the behavior of large self-attention neural networks coupled to a softmax output, which become exact in the large limit size. Our findings reveal nontrivial dynamical phenomena, including nonequilibrium phase transitions associated with chaotic bifurcations, even for very simple configurations with a few encoded features and a very short context window. Finally, we discuss the potential of our analytic approach to improve our understanding of the inner workings of transformer models, potentially reducing computational training costs and enhancing model interpretability.

Dynamical Mean-Field Theory of Self-Attention Neural Networks

TL;DR

The paper tackles the problem of understanding transformer dynamics by mapping self-attention to an asymmetric Hopfield network and formulating a nonequilibrium dynamical mean-field theory that uses a generating functional , exact in the large- limit. With 1-bit token encodings, it derives closed-form mean-field updates for overlaps and normalized attentions and discusses a simplified softmax output. It observes nonequilibrium phase transitions leading to periodic, quasi-periodic, and chaotic attractors, with memory effects extending beyond the short context window. The approach promises interpretability benefits and potential reductions in training cost and can be extended to more realistic transformer settings and finite-size corrections.

Abstract

Transformer-based models have demonstrated exceptional performance across diverse domains, becoming the state-of-the-art solution for addressing sequential machine learning problems. Even though we have a general understanding of the fundamental components in the transformer architecture, little is known about how they operate or what are their expected dynamics. Recently, there has been an increasing interest in exploring the relationship between attention mechanisms and Hopfield networks, promising to shed light on the statistical physics of transformer networks. However, to date, the dynamical regimes of transformer-like models have not been studied in depth. In this paper, we address this gap by using methods for the study of asymmetric Hopfield networks in nonequilibrium regimes --namely path integral methods over generating functionals, yielding dynamics governed by concurrent mean-field variables. Assuming 1-bit tokens and weights, we derive analytical approximations for the behavior of large self-attention neural networks coupled to a softmax output, which become exact in the large limit size. Our findings reveal nontrivial dynamical phenomena, including nonequilibrium phase transitions associated with chaotic bifurcations, even for very simple configurations with a few encoded features and a very short context window. Finally, we discuss the potential of our analytic approach to improve our understanding of the inner workings of transformer models, potentially reducing computational training costs and enhancing model interpretability.
Paper Structure (11 sections, 19 equations, 5 figures)

This paper contains 11 sections, 19 equations, 5 figures.

Figures (5)

  • Figure 1: a) Bipartite Hopfield network, equivalent to the attention mechanism when summing over $\tau \in \{1,2,\dots,t\}$. b) Description of a softmax output with a linear encoding over attention values.
  • Figure 2: Description of an attention layer connected to a softmax output. Each link between queries $k_t$ and keys $k_t$, as well as between attention values $v_t$ and outputs $o_t$, follow represent a bipartite Hopfield network as in Fig. \ref{['fig:bipartite_Hopfield']}. Attention values $v_t$ take the same value as $k_t$.
  • Figure 3: Bifurcation diagram for $\beta\in[0,3]$ (a) and $\beta \in[1.24,1.28]$ (b). For each $\beta$, we plot all the different points the trajectory has traversed (black points for periodic trajectories, yellow points for the rest). For quasi-periodic and chaotic trajectories, points interesecting with the plane $m^o_{2}=0$ are represented in orange and purple, respectively. In both bifurcation diagrams, we have first simulated the highest $\beta$ value and then used the context at the end of the execution as initial condition for other $\beta$. Only the semantic information regarding the mean-field variables is being shown (first term of Eq. \ref{['eq:mf_o_pe']}).
  • Figure 4: Trajectories of mean-field variables $m^o_1(t)$ vs. $m^o_2(t)$. Only the semantic part (first term of Eq. \ref{['eq:mf_o_pe']}) is represented for different $\beta$. Points at $\beta=1.27$ are larger to facilitate visualization.
  • Figure 5: Examples of quasi-periodic (left) and chaotic (right) trajectories (top). Mean-field trajectories (first term of first term of Eq. \ref{['eq:mf_o_pe']}) are plotted after 112,800 steps. We also show the discrete Fast Fourier Transform (middle) and autocorrelation function (bottom) of 20,000 samples of $m^o_{1,t}$ at steady-state. Note that the figure at the middle-right has a peak of 15 at $f=0.5$ not shown to better represent other frequencies.