Table of Contents
Fetching ...

The Linear Attention Resurrection in Vision Transformer

Chuanyang Zheng

TL;DR

Vision Transformers incur $O(N^2C)$ complexity due to softmax attention, hindering high-resolution vision tasks. The authors introduce Enhanced Linear Attention with a Local Concentration Module (LCM) and assemble them into L$^2$ViT, a hierarchical backbone that alternates Linear Global Attention (LGA) with Local Window Attention (LWA) to achieve global and local modeling under linear complexity $O(NC^2)$. Key contributions include the non-negative attention property, the LCM for local concentration, and the LGA/LWA architectural design, which collectively improve performance on ImageNet-1K (84.4% Top-1; 87.0% after 22k pretraining at 384^2) and excel in COCO object detection and ADE20K semantic segmentation. The results demonstrate that linear attention, when augmented with locality-focused modules, can rival or surpass softmax-based ViTs, offering scalable, versatile vision backbones for a range of tasks.

Abstract

Vision Transformers (ViTs) have recently taken computer vision by storm. However, the softmax attention underlying ViTs comes with a quadratic complexity in time and memory, hindering the application of ViTs to high-resolution images. We revisit the attention design and propose a linear attention method to address the limitation, which doesn't sacrifice ViT's core advantage of capturing global representation like existing methods (e.g. local window attention of Swin). We further investigate the key difference between linear attention and softmax attention. Our empirical results suggest that linear attention lacks a fundamental property of concentrating the distribution of the attention matrix. Inspired by this observation, we introduce a local concentration module to enhance linear attention. By incorporating enhanced linear global attention and local window attention, we propose a new ViT architecture, dubbed L$^2$ViT. Notably, L$^2$ViT can effectively capture both global interactions and local representations while enjoying linear computational complexity. Extensive experiments demonstrate the strong performance of L$^2$ViT. On image classification, L$^2$ViT achieves 84.4% Top-1 accuracy on ImageNet-1K without any extra training data or label. By further pre-training on ImageNet-22k, it attains 87.0% when fine-tuned with resolution 384$^2$. For downstream tasks, L$^2$ViT delivers favorable performance as a backbone on object detection as well as semantic segmentation.

The Linear Attention Resurrection in Vision Transformer

TL;DR

Vision Transformers incur complexity due to softmax attention, hindering high-resolution vision tasks. The authors introduce Enhanced Linear Attention with a Local Concentration Module (LCM) and assemble them into LViT, a hierarchical backbone that alternates Linear Global Attention (LGA) with Local Window Attention (LWA) to achieve global and local modeling under linear complexity . Key contributions include the non-negative attention property, the LCM for local concentration, and the LGA/LWA architectural design, which collectively improve performance on ImageNet-1K (84.4% Top-1; 87.0% after 22k pretraining at 384^2) and excel in COCO object detection and ADE20K semantic segmentation. The results demonstrate that linear attention, when augmented with locality-focused modules, can rival or surpass softmax-based ViTs, offering scalable, versatile vision backbones for a range of tasks.

Abstract

Vision Transformers (ViTs) have recently taken computer vision by storm. However, the softmax attention underlying ViTs comes with a quadratic complexity in time and memory, hindering the application of ViTs to high-resolution images. We revisit the attention design and propose a linear attention method to address the limitation, which doesn't sacrifice ViT's core advantage of capturing global representation like existing methods (e.g. local window attention of Swin). We further investigate the key difference between linear attention and softmax attention. Our empirical results suggest that linear attention lacks a fundamental property of concentrating the distribution of the attention matrix. Inspired by this observation, we introduce a local concentration module to enhance linear attention. By incorporating enhanced linear global attention and local window attention, we propose a new ViT architecture, dubbed LViT. Notably, LViT can effectively capture both global interactions and local representations while enjoying linear computational complexity. Extensive experiments demonstrate the strong performance of LViT. On image classification, LViT achieves 84.4% Top-1 accuracy on ImageNet-1K without any extra training data or label. By further pre-training on ImageNet-22k, it attains 87.0% when fine-tuned with resolution 384. For downstream tasks, LViT delivers favorable performance as a backbone on object detection as well as semantic segmentation.

Paper Structure

This paper contains 27 sections, 7 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Grad-CAM selvaraju2017grad activation maps of DeiT-Tiny touvron2021training equipped with different attention mechanisms, i.e., softmax attention, linear attention, and enhanced linear attention. The first row is the original input images. Enhanced linear attention can substantially eliminate some irrelevant distractions and focus better on the object itself, such as the objects in the fifth and last columns.
  • Figure 2: The attention maps of softmax, linear attention using ReLU as $\phi$, and local enhanced linear attention.$x$ and $y$ axes indicate the patches. The deeper the network, the longer-range dependency the attention mechanism extracts.
  • Figure 3: Left: the overall architecture of our proposed L$^2$ViT. Right: the illustration of the Local Window Attention block (LWA) and Linear Global Attention block (LGA). MLP indicates the Multi-Layer Perceptron. WA indicates Window Attention, and LA indicates Linear Attention.
  • Figure 4: Comparison of different MLP layers as Plain MLP in vaswani2017attention (left, used by L$^2$ViT), PVTv2 (middle), and locally improved MLP (right).
  • Figure 5: The Pytorch-style code for LCM.