Table of Contents
Fetching ...

On the Role of Hidden States of Modern Hopfield Network in Transformer

Tsubasa Masumura, Masato Taki

TL;DR

This paper extends the link between modern Hopfield networks and Transformer architectures by preserving hidden-state dynamics across layers, yielding Modern Hopfield Attention (MHA). MHA augments self-attention with an accumulated hidden state that reuses upper-layer attention scores, improving attention quality and mitigating rank collapse without adding parameters. Across GPT-2, LLaMA, ViT, and ImageNet-1k, MHA delivers consistent perplexity and accuracy gains, and enhances transfer performance in downstream tasks. The work provides both theoretical and empirical support for a Hopfield-inspired pathway to strengthen Transformer architectures with minimal computational overhead and no additional trainable parameters.

Abstract

Associative memory models based on Hopfield networks and self-attention based on key-value mechanisms have been popular approaches in the study of memory mechanisms in deep learning. It has been pointed out that the state update rule of the modern Hopfield network (MHN) in the adiabatic approximation is in agreement with the self-attention layer of Transformer. In this paper, we go beyond this approximation and investigate the relationship between MHN and self-attention. Our results show that the correspondence between Hopfield networks and Transformers can be established in a more generalized form by adding a new variable, the hidden state derived from the MHN, to self-attention. This new attention mechanism, modern Hopfield attention (MHA), allows the inheritance of attention scores from the input layer of the Transformer to the output layer, which greatly improves the nature of attention weights. In particular, we show both theoretically and empirically that MHA hidden states significantly improve serious problem of deep Transformers known as rank collapse and token uniformity. We also confirm that MHA can systematically improve accuracy without adding training parameters to the Vision Transformer or GPT. Our results provide a new case in which Hopfield networks can be a useful perspective for improving the Transformer architecture.

On the Role of Hidden States of Modern Hopfield Network in Transformer

TL;DR

This paper extends the link between modern Hopfield networks and Transformer architectures by preserving hidden-state dynamics across layers, yielding Modern Hopfield Attention (MHA). MHA augments self-attention with an accumulated hidden state that reuses upper-layer attention scores, improving attention quality and mitigating rank collapse without adding parameters. Across GPT-2, LLaMA, ViT, and ImageNet-1k, MHA delivers consistent perplexity and accuracy gains, and enhances transfer performance in downstream tasks. The work provides both theoretical and empirical support for a Hopfield-inspired pathway to strengthen Transformer architectures with minimal computational overhead and no additional trainable parameters.

Abstract

Associative memory models based on Hopfield networks and self-attention based on key-value mechanisms have been popular approaches in the study of memory mechanisms in deep learning. It has been pointed out that the state update rule of the modern Hopfield network (MHN) in the adiabatic approximation is in agreement with the self-attention layer of Transformer. In this paper, we go beyond this approximation and investigate the relationship between MHN and self-attention. Our results show that the correspondence between Hopfield networks and Transformers can be established in a more generalized form by adding a new variable, the hidden state derived from the MHN, to self-attention. This new attention mechanism, modern Hopfield attention (MHA), allows the inheritance of attention scores from the input layer of the Transformer to the output layer, which greatly improves the nature of attention weights. In particular, we show both theoretically and empirically that MHA hidden states significantly improve serious problem of deep Transformers known as rank collapse and token uniformity. We also confirm that MHA can systematically improve accuracy without adding training parameters to the Vision Transformer or GPT. Our results provide a new case in which Hopfield networks can be a useful perspective for improving the Transformer architecture.

Paper Structure

This paper contains 39 sections, 5 theorems, 64 equations, 22 figures, 10 tables.

Key Result

Theorem 5.1

The norm of the residual of attention-only network ${\textrm{AttnNet}}(\boldsymbol{X})$ decays as where $r=\frac{8H}{\sqrt{d_k}}$ and $C$ is certain constant. This suggests the double exponential decay of the rank.

Figures (22)

  • Figure 1: (a) The left figure shows the layer structure of Transformer architecture using modern Hopfield attention (MHA). As the hidden state $\boldsymbol{H}_n$ propagates through each attention layer, information from the upper layer's attention scores is reused in the lower layers. Attention score $\boldsymbol{Q}_n\boldsymbol{K}_n^\top$ is accumulated in the hidden state of each layer, and this value is used for attention calculation. (b) A visualization of the token uniformity in layers 12 and 24 of GPT-2 (Medium) trained on the Wikitext103 dataset, showing a violin plot of the cosine similarity between the tokens. For GPT-2 in the left column, there is a strong peak at similarity 1, and both layers have a mode of 1. On the other hand, in the case of GPT-2 with MHA in the right column, the cosine similarity is kept low and the uniformity of the tokens is dramatically improved.
  • Figure 2: The violin plots of cosine similarity between tokens in several layers for (a) GPT-2 (Medium) trained on Wikitext103 and (b) ViT-B trained on CIFAR100. MHA layers with high average similarity of tokens exist, but tokens with a perfect similarity of 1, as in the case of self-attention, disappear, preventing their ranks from dropping.
  • Figure 3: The architecture of the MHA examined in detail in this paper. This model corresponds to the case where the forward derivative is used for the visible state and the backward derivative for the hidden state. Simply setting $\alpha=\alpha'=0$ reproduces normal self-attention.
  • Figure 4: The architecture corresponds to the case where the forward derivative is used for the visible state and the hidden state.
  • Figure 5: Possible interpretations as a recurrent neural net when both visible and hidden states use backward differentiation.
  • ...and 17 more figures

Theorems & Definitions (6)

  • Theorem 5.1: dong2021attention
  • Theorem 5.2
  • proof
  • Lemma D.1
  • Lemma D.2
  • Theorem D.3