Table of Contents
Fetching ...

Self-attention as an attractor network: transient memories without backpropagation

Francesco D'Amico, Matteo Negri

TL;DR

This work shows that it is possible to obtain the self-attention layer as the derivative of local energy terms, which resemble a pseudo-likelihood, and presents a novel framework to interpret self-attention as an attractor network.

Abstract

Transformers are one of the most successful architectures of modern neural networks. At their core there is the so-called attention mechanism, which recently interested the physics community as it can be written as the derivative of an energy function in certain cases: while it is possible to write the cross-attention layer as a modern Hopfield network, the same is not possible for the self-attention, which is used in the GPT architectures and other autoregressive models. In this work we show that it is possible to obtain the self-attention layer as the derivative of local energy terms, which resemble a pseudo-likelihood. We leverage the analogy with pseudo-likelihood to design a recurrent model that can be trained without backpropagation: the dynamics shows transient states that are strongly correlated with both train and test examples. Overall we present a novel framework to interpret self-attention as an attractor network, potentially paving the way for new theoretical approaches inspired from physics to understand transformers.

Self-attention as an attractor network: transient memories without backpropagation

TL;DR

This work shows that it is possible to obtain the self-attention layer as the derivative of local energy terms, which resemble a pseudo-likelihood, and presents a novel framework to interpret self-attention as an attractor network.

Abstract

Transformers are one of the most successful architectures of modern neural networks. At their core there is the so-called attention mechanism, which recently interested the physics community as it can be written as the derivative of an energy function in certain cases: while it is possible to write the cross-attention layer as a modern Hopfield network, the same is not possible for the self-attention, which is used in the GPT architectures and other autoregressive models. In this work we show that it is possible to obtain the self-attention layer as the derivative of local energy terms, which resemble a pseudo-likelihood. We leverage the analogy with pseudo-likelihood to design a recurrent model that can be trained without backpropagation: the dynamics shows transient states that are strongly correlated with both train and test examples. Overall we present a novel framework to interpret self-attention as an attractor network, potentially paving the way for new theoretical approaches inspired from physics to understand transformers.
Paper Structure (11 sections, 8 equations, 2 figures)

This paper contains 11 sections, 8 equations, 2 figures.

Figures (2)

  • Figure 1: Test examples appear as transient states of the dynamics. We compare the performance on the test set of various models on two different tasks: in the upper panels we show a masked prediction task, in the lower panels we show a denoising task. Note that the transformers are trained separately for the two tasks, while the self attention is trained in a task-agnostic way. The plots in the left panels show the mean square error of a prediction after $t$ iterations of the model: the blue lines correspond to a transformer block, the orange to the bare self attention layer and the green dashed to a standard 5-layer transformer (the line is horizontal because we do not repeat this model). The columns in the right panels show examples of performances of transformer blocks and bare self attention. In the masked prediction task we masked $30\%$ of the pixels. In the denoising task we added a Gaussian noise with zero mean and $0.7$ variance on each pixel. Note that the $t=1$ the denoising task corresponds to the MSE of the corrupted images and $t=2$ is the error after the first iteration. For the masked perdition, the error of the corrupted images is not well defined; therefore, $t=1$ is the error after the first iteration.
  • Figure 2: Bare Self-Attention predicts more uniform patches. We plot the distribution of variances of predicted pixels for different models, on the masked prediction task. As we make the model simpler, the variance within a patch decreases, meaning that the prediction becomes more and more uniform. For this plot, the Bare Self-Attention was trained on $4\times4$ patches like the other two models.