Table of Contents
Fetching ...

Outlier-Efficient Hopfield Layers for Large Transformer-Based Models

Jerry Yao-Chieh Hu, Pei-Hsuan Chang, Robin Luo, Hong-Yu Chen, Weijian Li, Wei-Po Wang, Han Liu

TL;DR

This work tackles the outlier inefficiency in training large transformer models by introducing Outlier-Efficient Hopfield Layers (OutEffHop). It formulates an outlier-aware modern Hopfield model with an added no-op classification dimension and a refined energy function, whose retrieval dynamics align with an outlier-efficient attention mechanism and, in a single step, approximate Softmax-based attention. Theoretical contributions include convergence guarantees, tighter retrieval-error bounds, and a generalization bound for the OutEffHop layer. Empirically, OutEffHop reduces outlier-related metrics (average kurtosis and maximum infinity norm) across BERT, OPT, ViT, and STanHop-Net, and complements existing clipping-based attention methods, with strong performance in both standard and quantized settings. The approach promises more robust, memory-efficient large-scale models, while acknowledging limitations related to LayerNorm-induced outliers and biases in training data.

Abstract

We introduce an Outlier-Efficient Modern Hopfield Model (termed $\mathrm{OutEffHop}$) and use it to address the outlier inefficiency problem of {training} gigantic transformer-based models. Our main contribution is a novel associative memory model facilitating \textit{outlier-efficient} associative memory retrievals. Interestingly, this memory model manifests a model-based interpretation of an outlier-efficient attention mechanism (${\rm Softmax}_1$): it is an approximation of the memory retrieval process of $\mathrm{OutEffHop}$. Methodologically, this allows us to introduce novel outlier-efficient Hopfield layers as powerful alternatives to traditional attention mechanisms, with superior post-quantization performance. Theoretically, the Outlier-Efficient Modern Hopfield Model retains and improves the desirable properties of standard modern Hopfield models, including fixed point convergence and exponential storage capacity. Empirically, we demonstrate the efficacy of the proposed model across large-scale transformer-based and Hopfield-based models (including BERT, OPT, ViT, and STanHop-Net), benchmarking against state-of-the-art methods like $\mathtt{Clipped\_Softmax}$ and $\mathtt{Gated\_Attention}$. Notably, $\mathrm{OutEffHop}$ achieves an average reduction of 22+\% in average kurtosis and 26+\% in the maximum infinity norm of model outputs across four models. Code is available at \href{https://github.com/MAGICS-LAB/OutEffHop}{GitHub}; models are on \href{https://huggingface.co/collections/magicslabnu/outeffhop-6610fcede8d2cda23009a98f}{Hugging Face Hub}; future updates are on \href{https://arxiv.org/abs/2404.03828}{arXiv}.

Outlier-Efficient Hopfield Layers for Large Transformer-Based Models

TL;DR

This work tackles the outlier inefficiency in training large transformer models by introducing Outlier-Efficient Hopfield Layers (OutEffHop). It formulates an outlier-aware modern Hopfield model with an added no-op classification dimension and a refined energy function, whose retrieval dynamics align with an outlier-efficient attention mechanism and, in a single step, approximate Softmax-based attention. Theoretical contributions include convergence guarantees, tighter retrieval-error bounds, and a generalization bound for the OutEffHop layer. Empirically, OutEffHop reduces outlier-related metrics (average kurtosis and maximum infinity norm) across BERT, OPT, ViT, and STanHop-Net, and complements existing clipping-based attention methods, with strong performance in both standard and quantized settings. The approach promises more robust, memory-efficient large-scale models, while acknowledging limitations related to LayerNorm-induced outliers and biases in training data.

Abstract

We introduce an Outlier-Efficient Modern Hopfield Model (termed ) and use it to address the outlier inefficiency problem of {training} gigantic transformer-based models. Our main contribution is a novel associative memory model facilitating \textit{outlier-efficient} associative memory retrievals. Interestingly, this memory model manifests a model-based interpretation of an outlier-efficient attention mechanism (): it is an approximation of the memory retrieval process of . Methodologically, this allows us to introduce novel outlier-efficient Hopfield layers as powerful alternatives to traditional attention mechanisms, with superior post-quantization performance. Theoretically, the Outlier-Efficient Modern Hopfield Model retains and improves the desirable properties of standard modern Hopfield models, including fixed point convergence and exponential storage capacity. Empirically, we demonstrate the efficacy of the proposed model across large-scale transformer-based and Hopfield-based models (including BERT, OPT, ViT, and STanHop-Net), benchmarking against state-of-the-art methods like and . Notably, achieves an average reduction of 22+\% in average kurtosis and 26+\% in the maximum infinity norm of model outputs across four models. Code is available at \href{https://github.com/MAGICS-LAB/OutEffHop}{GitHub}; models are on \href{https://huggingface.co/collections/magicslabnu/outeffhop-6610fcede8d2cda23009a98f}{Hugging Face Hub}; future updates are on \href{https://arxiv.org/abs/2404.03828}{arXiv}.
Paper Structure (48 sections, 17 theorems, 89 equations, 8 figures, 4 tables)

This paper contains 48 sections, 17 theorems, 89 equations, 8 figures, 4 tables.

Key Result

Lemma 2.1

Let ${\rm{Softmax}}_1(z) \coloneqq \exp{z}/$∑_μ=1^M z_μ+1$$ for any $z\in\mathbb{R}^M$ and $t$ be the iteration number. The memory retrieval dynamics: monotonically minimizes the energy eqn:H_energy over $t$.

Figures (8)

  • Figure 1: Visualization of Outlier-Efficient Hopfield Model.
  • Figure 2: The Impact of $\mathtt{OutEffHop}$ on Maximum Infinity Norm $\norm{\mathbf{x}}_{\infty}$ Changes During Pretraining of (a) BERT, (b) OPT, (c) ViT, and (d) STanHop-Net. The plots, from left to right, compare $\mathtt{OutEffHop}$ with the vanilla attention baseline and their combination with $\mathtt{Clipped\_Softmax}$ and $\mathtt{Gated\_Attention}$ as per bondarenko2023quantizable. Each figure's y-axis scale varies. For better visualization, we focus on the outlier reduction in layer 10 of the BERT, ViT and OPT model, and in layer 9 of the STanHop-Net. In all settings, $\mathtt{OutEffHop}$ delivers significant reduction of the $\norm{\mathbf{x}}_{\infty}$ compared to the vanilla attention and improves $\mathtt{Clipped\_Softmax}$ and $\mathtt{Gated\_Attention}$.
  • Figure 3: The trend of Feed-Forward Network (FFN) output maximum infinity norm values in layers 3, 6, 9, and 10 of a BERT encoder is analyzed using two softmax variations: $\mathtt{OutEffHop}$ (represented in red) and vanilla ${\rm{Softmax}}$ (in grey). The findings indicate that $\mathtt{OutEffHop}$ significantly reduces outliers in the model compared to the vanilla ${\rm{Softmax}}$.
  • Figure 4: Maximum infinity norm $\| \mathbf{x} \|_{\infty}$ for different tensor components within layer 10 of BERT. Our work is analysed using two softmax variations: $\mathtt{OutEffHop}$ (represented in red) and vanilla ${\rm{Softmax}}$ (in grey). We find $\mathtt{OutEffHop}$ suppresses the outliers growing in both FFN layers.
  • Figure 5: Memory Capacity. Our extensive evaluation of memory capacity across various Hopfield Networks, including Vanilla Modern Hopfield, Sparse Hopfield, 10th Order Hopfield, and our $\mathtt{OutEffHop}$, is conducted on two image datasets: MNIST and CIFAR10. We observe that $\mathtt{OutEffHop}$ outperforms its baselines, especially when the memory set size is large.
  • ...and 3 more figures

Theorems & Definitions (42)

  • Remark 2.1
  • Remark 2.2
  • Remark 2.3
  • Lemma 2.1: Retrieval Dynamics
  • proof : Proof Sketch
  • Remark 2.4
  • Remark 2.5
  • Definition 3.1: Storage and Retrieval
  • Theorem 3.1: Convergence of $\mathcal{T}_{\text{OutEff}}$
  • proof : Proof Sketch
  • ...and 32 more