On the Role of Hidden States of Modern Hopfield Network in Transformer
Tsubasa Masumura, Masato Taki
TL;DR
This paper extends the link between modern Hopfield networks and Transformer architectures by preserving hidden-state dynamics across layers, yielding Modern Hopfield Attention (MHA). MHA augments self-attention with an accumulated hidden state that reuses upper-layer attention scores, improving attention quality and mitigating rank collapse without adding parameters. Across GPT-2, LLaMA, ViT, and ImageNet-1k, MHA delivers consistent perplexity and accuracy gains, and enhances transfer performance in downstream tasks. The work provides both theoretical and empirical support for a Hopfield-inspired pathway to strengthen Transformer architectures with minimal computational overhead and no additional trainable parameters.
Abstract
Associative memory models based on Hopfield networks and self-attention based on key-value mechanisms have been popular approaches in the study of memory mechanisms in deep learning. It has been pointed out that the state update rule of the modern Hopfield network (MHN) in the adiabatic approximation is in agreement with the self-attention layer of Transformer. In this paper, we go beyond this approximation and investigate the relationship between MHN and self-attention. Our results show that the correspondence between Hopfield networks and Transformers can be established in a more generalized form by adding a new variable, the hidden state derived from the MHN, to self-attention. This new attention mechanism, modern Hopfield attention (MHA), allows the inheritance of attention scores from the input layer of the Transformer to the output layer, which greatly improves the nature of attention weights. In particular, we show both theoretically and empirically that MHA hidden states significantly improve serious problem of deep Transformers known as rank collapse and token uniformity. We also confirm that MHA can systematically improve accuracy without adding training parameters to the Vision Transformer or GPT. Our results provide a new case in which Hopfield networks can be a useful perspective for improving the Transformer architecture.
