Table of Contents
Fetching ...

Element-wise Attention Is All You Need

Guoxin Feng

TL;DR

A novel element-wise attention mechanism is proposed, which uses the element-wise squared Euclidean distance, instead of the dot product operation, to compute similarity and approximates the quadratic complexity term $\exp(q_{ic}k_{jc})$ with a Taylor polynomial.

Abstract

The self-attention (SA) mechanism has demonstrated superior performance across various domains, yet it suffers from substantial complexity during both training and inference. The next-generation architecture, aiming at retaining the competitive performance of SA while achieving low-cost inference and efficient long-sequence training, primarily focuses on three approaches: linear attention, linear RNNs, and state space models. Although these approaches achieve reduced complexity than SA, they all have built-in performance degradation factors, such as diminished “spikiness” and compression of historical information. In contrast to these approaches, we propose a novel element-wise attention mechanism, which uses the element-wise squared Euclidean distance, instead of the dot product operation, to compute similarity and approximates the quadratic complexity term $\exp(q_{ic}k_{jc})$ with a Taylor polynomial. This design achieves remarkable efficiency: during training, the element-wise attention has a complexity of $\mathcal{O}(tLD)$, making long-sequence training both computationally and memory efficient, where $L$ is the sequence length, $D$ is the feature dimension, and $t$ is the highest order of the polynomial; during inference, it can be reformulated as recurrent neural networks, achieving a inference complexity of $\mathcal{O}(tD)$. Furthermore, the element-wise attention circumvents the performance degradation factors present in these approaches and achieves performance comparable to SA in both causal and non-causal forms.

Element-wise Attention Is All You Need

TL;DR

A novel element-wise attention mechanism is proposed, which uses the element-wise squared Euclidean distance, instead of the dot product operation, to compute similarity and approximates the quadratic complexity term with a Taylor polynomial.

Abstract

The self-attention (SA) mechanism has demonstrated superior performance across various domains, yet it suffers from substantial complexity during both training and inference. The next-generation architecture, aiming at retaining the competitive performance of SA while achieving low-cost inference and efficient long-sequence training, primarily focuses on three approaches: linear attention, linear RNNs, and state space models. Although these approaches achieve reduced complexity than SA, they all have built-in performance degradation factors, such as diminished “spikiness” and compression of historical information. In contrast to these approaches, we propose a novel element-wise attention mechanism, which uses the element-wise squared Euclidean distance, instead of the dot product operation, to compute similarity and approximates the quadratic complexity term with a Taylor polynomial. This design achieves remarkable efficiency: during training, the element-wise attention has a complexity of , making long-sequence training both computationally and memory efficient, where is the sequence length, is the feature dimension, and is the highest order of the polynomial; during inference, it can be reformulated as recurrent neural networks, achieving a inference complexity of . Furthermore, the element-wise attention circumvents the performance degradation factors present in these approaches and achieves performance comparable to SA in both causal and non-causal forms.
Paper Structure (12 sections, 23 equations, 5 figures, 4 tables)

This paper contains 12 sections, 23 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of EA’s computation process. Specifically, we obtain the similarity scores $o_{ijc}\in\mathbb{R}$ by computing the squared Euclidean distances between the query element $q_{ic}\in\mathbb{R}$ and the key element $k_{jc}\in\mathbb{R}$. Subsequently, the Softmax function converts $o_{i:c}\in\mathbb{R}^L$ into weights, which are assigned to $v_{:c}\in\mathbb{R}^L$ to produce $y_{ic}\in\mathbb{R}$.
  • Figure 2: PyTorch implementation of EA-series.
  • Figure 3: An illustration of $e^x$ alongside its second- and sixth-order Taylor polynomials.
  • Figure 4: Training cost of EA-2, EA-6, and SA. Specifically, (a) illustrates their memory usage, (b) presents their BS-L curves, and (c) shows their throughput.
  • Figure 5: Inference cost of EA-2, EA-6, and SA. Specifically, (a) illustrates their memory usage, and (b) presents their latency.