Element-wise Attention Is All You Need

Guoxin Feng

Element-wise Attention Is All You Need

Guoxin Feng

TL;DR

A novel element-wise attention mechanism is proposed, which uses the element-wise squared Euclidean distance, instead of the dot product operation, to compute similarity and approximates the quadratic complexity term $\exp(q_{ic}k_{jc})$ with a Taylor polynomial.

Abstract

The self-attention (SA) mechanism has demonstrated superior performance across various domains, yet it suffers from substantial complexity during both training and inference. The next-generation architecture, aiming at retaining the competitive performance of SA while achieving low-cost inference and efficient long-sequence training, primarily focuses on three approaches: linear attention, linear RNNs, and state space models. Although these approaches achieve reduced complexity than SA, they all have built-in performance degradation factors, such as diminished “spikiness” and compression of historical information. In contrast to these approaches, we propose a novel element-wise attention mechanism, which uses the element-wise squared Euclidean distance, instead of the dot product operation, to compute similarity and approximates the quadratic complexity term $\exp(q_{ic}k_{jc})$ with a Taylor polynomial. This design achieves remarkable efficiency: during training, the element-wise attention has a complexity of $\mathcal{O}(tLD)$, making long-sequence training both computationally and memory efficient, where $L$ is the sequence length, $D$ is the feature dimension, and $t$ is the highest order of the polynomial; during inference, it can be reformulated as recurrent neural networks, achieving a inference complexity of $\mathcal{O}(tD)$. Furthermore, the element-wise attention circumvents the performance degradation factors present in these approaches and achieves performance comparable to SA in both causal and non-causal forms.

Element-wise Attention Is All You Need

TL;DR

with a Taylor polynomial.

Abstract

with a Taylor polynomial. This design achieves remarkable efficiency: during training, the element-wise attention has a complexity of

, making long-sequence training both computationally and memory efficient, where

is the sequence length,

is the feature dimension, and

is the highest order of the polynomial; during inference, it can be reformulated as recurrent neural networks, achieving a inference complexity of

. Furthermore, the element-wise attention circumvents the performance degradation factors present in these approaches and achieves performance comparable to SA in both causal and non-causal forms.

Paper Structure (12 sections, 23 equations, 5 figures, 4 tables)

This paper contains 12 sections, 23 equations, 5 figures, 4 tables.

Introduction
Notations
Element-wise Attention
EA
EA-series
Causal EA-series
Relation to and Differences from Previous Methods
Experiment
Performance Comparisons
Training Cost
Inference Cost
Conclusion

Figures (5)

Figure 1: Illustration of EA’s computation process. Specifically, we obtain the similarity scores $o_{ijc}\in\mathbb{R}$ by computing the squared Euclidean distances between the query element $q_{ic}\in\mathbb{R}$ and the key element $k_{jc}\in\mathbb{R}$. Subsequently, the Softmax function converts $o_{i:c}\in\mathbb{R}^L$ into weights, which are assigned to $v_{:c}\in\mathbb{R}^L$ to produce $y_{ic}\in\mathbb{R}$.
Figure 2: PyTorch implementation of EA-series.
Figure 3: An illustration of $e^x$ alongside its second- and sixth-order Taylor polynomials.
Figure 4: Training cost of EA-2, EA-6, and SA. Specifically, (a) illustrates their memory usage, (b) presents their BS-L curves, and (c) shows their throughput.
Figure 5: Inference cost of EA-2, EA-6, and SA. Specifically, (a) illustrates their memory usage, and (b) presents their latency.

Element-wise Attention Is All You Need

TL;DR

Abstract

Element-wise Attention Is All You Need

Authors

TL;DR

Abstract

Table of Contents

Figures (5)