Table of Contents
Fetching ...

Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks

Meng-Hao Guo, Zheng-Ning Liu, Tai-Jiang Mu, Shi-Min Hu

TL;DR

Self-attention in visual models suffers from quadratic complexity and ignores cross-sample correlations. This work introduces external attention, which uses two shared memories $M_k$ and $M_v$ to compute an attention map $A$ and refine features with $F_{out} = A M_v$, achieving $O(dSN)$ complexity and enabling cross-sample context learning. Extending to multi-head external attention (EAMLP) yields an all-MLP architecture that matches or surpasses CNNs and Transformers on ImageNet while maintaining substantially lower compute. Across classification, detection, segmentation, generation, and 3D tasks, external attention delivers competitive or superior performance with notable savings in parameters and runtime, underscoring its practical impact for scalable visual modeling.

Abstract

Attention mechanisms, especially self-attention, have played an increasingly important role in deep feature representation for visual tasks. Self-attention updates the feature at each position by computing a weighted sum of features using pair-wise affinities across all positions to capture the long-range dependency within a single sample. However, self-attention has quadratic complexity and ignores potential correlation between different samples. This paper proposes a novel attention mechanism which we call external attention, based on two external, small, learnable, shared memories, which can be implemented easily by simply using two cascaded linear layers and two normalization layers; it conveniently replaces self-attention in existing popular architectures. External attention has linear complexity and implicitly considers the correlations between all data samples. We further incorporate the multi-head mechanism into external attention to provide an all-MLP architecture, external attention MLP (EAMLP), for image classification. Extensive experiments on image classification, object detection, semantic segmentation, instance segmentation, image generation, and point cloud analysis reveal that our method provides results comparable or superior to the self-attention mechanism and some of its variants, with much lower computational and memory costs.

Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks

TL;DR

Self-attention in visual models suffers from quadratic complexity and ignores cross-sample correlations. This work introduces external attention, which uses two shared memories and to compute an attention map and refine features with , achieving complexity and enabling cross-sample context learning. Extending to multi-head external attention (EAMLP) yields an all-MLP architecture that matches or surpasses CNNs and Transformers on ImageNet while maintaining substantially lower compute. Across classification, detection, segmentation, generation, and 3D tasks, external attention delivers competitive or superior performance with notable savings in parameters and runtime, underscoring its practical impact for scalable visual modeling.

Abstract

Attention mechanisms, especially self-attention, have played an increasingly important role in deep feature representation for visual tasks. Self-attention updates the feature at each position by computing a weighted sum of features using pair-wise affinities across all positions to capture the long-range dependency within a single sample. However, self-attention has quadratic complexity and ignores potential correlation between different samples. This paper proposes a novel attention mechanism which we call external attention, based on two external, small, learnable, shared memories, which can be implemented easily by simply using two cascaded linear layers and two normalization layers; it conveniently replaces self-attention in existing popular architectures. External attention has linear complexity and implicitly considers the correlations between all data samples. We further incorporate the multi-head mechanism into external attention to provide an all-MLP architecture, external attention MLP (EAMLP), for image classification. Extensive experiments on image classification, object detection, semantic segmentation, instance segmentation, image generation, and point cloud analysis reveal that our method provides results comparable or superior to the self-attention mechanism and some of its variants, with much lower computational and memory costs.

Paper Structure

This paper contains 20 sections, 6 equations, 6 figures, 12 tables, 2 algorithms.

Figures (6)

  • Figure 1: Self-attention versus external-attention
  • Figure 2: Multi-head self-attention and multi-head external-attention.
  • Figure 3: EANet architecture for semantic segmentation using our proposed external attention.
  • Figure 4: Attention map and segmentation results on Pascal VOC test set. Left to right: input images, attention maps w.r.t. three selected entries in the external memory, segmentation results.
  • Figure 5: Multi-head attention map in the last layer of EAMLP-14 on ImageNet val set. Left: Input image Others: 24 head attention map in the last layer of EAMLP-14 for the ImageNet val set. Last two rows: attention of two different rows of $M_k$ to the image patches.
  • ...and 1 more figures