Interactive Multi-Head Self-Attention with Linear Complexity

Hankyul Kang; Ming-Hsuan Yang; Jongbin Ryu

Interactive Multi-Head Self-Attention with Linear Complexity

Hankyul Kang, Ming-Hsuan Yang, Jongbin Ryu

TL;DR

This work tackles the quadratic cost of cross-head interactions in multi-head self-attention by introducing iMHSA, which decomposes the attention operation into query- and key-less components and employs a reverse-order computation to enable cross-head connectivity with linear-like complexity. It further integrates a cross-head interaction on decomposed matrices, achieving a linear overall complexity of $O(2NLh(d+h))$ and memory efficiency by computing and storing only intermediate terms like $\mathcal{A_K}^{\top} \mathcal{V}$. The authors instantiate a new Vision Transformer backbone, iViT, using iMHSA in deeper stages and show competitive or superior performance across image classification, object detection/segmentation, and semantic segmentation with favorable compute/memory trade-offs relative to state-of-the-art transformers. Analyses and ablations demonstrate that cross-head interactions increase feature diversity and that more heads with interaction yield continued gains, especially for high-token scenarios, suggesting a new direction for MHSA design in vision models.

Abstract

We propose an efficient interactive method for multi-head self-attention via decomposition. For existing methods using multi-head self-attention, the attention operation of each head is computed independently. However, we show that the interactions between cross-heads of the attention matrix enhance the information flow of the attention operation. Considering that the attention matrix of each head can be seen as a feature of networks, it is beneficial to establish connectivity between them to capture interactions better. However, a straightforward approach to capture the interactions between the cross-heads is computationally prohibitive as the complexity grows substantially with the high dimension of an attention matrix. In this work, we propose an effective method to decompose the attention operation into query- and key-less components. This will result in a more manageable size for the attention matrix, specifically for the cross-head interactions. Expensive experimental results show that the proposed cross-head interaction approach performs favorably against existing efficient attention methods and state-of-the-art backbone models.

Interactive Multi-Head Self-Attention with Linear Complexity

TL;DR

and memory efficiency by computing and storing only intermediate terms like

. The authors instantiate a new Vision Transformer backbone, iViT, using iMHSA in deeper stages and show competitive or superior performance across image classification, object detection/segmentation, and semantic segmentation with favorable compute/memory trade-offs relative to state-of-the-art transformers. Analyses and ablations demonstrate that cross-head interactions increase feature diversity and that more heads with interaction yield continued gains, especially for high-token scenarios, suggesting a new direction for MHSA design in vision models.

Abstract

Paper Structure (37 sections, 8 equations, 8 figures, 10 tables)

This paper contains 37 sections, 8 equations, 8 figures, 10 tables.

Introduction
Related Work
Sparsity-based attention.
Kernel-based attention.
Low-rank-based attention.
Refined attention.
Method
Preliminaries
Multi-head self-attention
Cross-Head Interaction
Computational Complexity.
Memory Usage.
Interactive MHSA with Linear Complexity
Self-Attention Decomposition
Interactive MHSA with Linear Complexity
...and 22 more sections

Figures (8)

Figure 1: Experimental evaluations of our and SOTA networks. We compare the trade-off between computational complexity (FLOPs) and performance (top-1 accuracy, mAP, and more) on four tasks.
Figure 2: Schematic illustration of the (a) baseline MHSA, (b) MHSA with only Decomposition, (c) MHSA with only Interaction, and (d) our iMHSA with both Decomposition and Interaction.
Figure 3: Experimental comparisons of the runtime and memory usage. We measure them in a single attention block using ViT-S. The 'MHSA with the cross-head Interaction' enormously increases the runtime and memory usage, but our Decomposition method requires minimal resources even with the Interaction. The 'iMHSA' denotes our approach with both the Interaction and Decomposition methods. There is a very small difference in (b) between the 'MHSA /w Decomposition' and 'iMHSA'.
Figure 4: Architectural design of the (a) iMHSA-based attention block, (b) iMHSA, and (c) cross-head interaction.
Figure 5: Visualization of attention matrix of the original MHSA and our iMHSA method. We use the ViT-Tiny/8 model trained on the ImageNet-1K dataset.
...and 3 more figures

Interactive Multi-Head Self-Attention with Linear Complexity

TL;DR

Abstract

Interactive Multi-Head Self-Attention with Linear Complexity

Authors

TL;DR

Abstract

Table of Contents

Figures (8)