Table of Contents
Fetching ...

Bridging the Divide: Reconsidering Softmax and Linear Attention

Dongchen Han, Yifan Pu, Zhuofan Xia, Yizeng Han, Xuran Pan, Xiu Li, Jiwen Lu, Shiji Song, Gao Huang

TL;DR

The paper confronts the efficiency gap between Softmax attention ($O(N^2)$) and linear attention ($O(N)$) in Vision Transformers. It identifies two pivotal properties—injectivity of the attention mapping and robust local modeling—as the main reasons for Softmax's superior performance and linear attention's limitations. The authors prove Softmax is injective under mild conditions while linear attention is not, and they introduce InLine, an injective linear attention, augmented with a local-bias residual to strengthen local modeling. Empirical validation across Swin, DeiT, PVT, and CSWin on ImageNet-1K, COCO, and ADE20K shows InLine variants can outperform Softmax baselines with comparable or lower FLOPs and improved throughput at larger window sizes. This work provides a principled route to scaling Vision Transformers to high-resolution inputs without sacrificing accuracy or excessively increasing computation.

Abstract

Widely adopted in modern Vision Transformer designs, Softmax attention can effectively capture long-range visual information; however, it incurs excessive computational cost when dealing with high-resolution inputs. In contrast, linear attention naturally enjoys linear complexity and has great potential to scale up to higher-resolution images. Nonetheless, the unsatisfactory performance of linear attention greatly limits its practical application in various scenarios. In this paper, we take a step forward to close the gap between the linear and Softmax attention with novel theoretical analyses, which demystify the core factors behind the performance deviations. Specifically, we present two key perspectives to understand and alleviate the limitations of linear attention: the injective property and the local modeling ability. Firstly, we prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors, thus adding to severe semantic confusion since different queries correspond to the same outputs. Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short. The aforementioned two fundamental differences significantly contribute to the disparities between these two attention paradigms, which is demonstrated by our substantial empirical validation in the paper. In addition, more experiment results indicate that linear attention, as long as endowed with these two properties, can outperform Softmax attention across various tasks while maintaining lower computation complexity. Code is available at https://github.com/LeapLabTHU/InLine.

Bridging the Divide: Reconsidering Softmax and Linear Attention

TL;DR

The paper confronts the efficiency gap between Softmax attention () and linear attention () in Vision Transformers. It identifies two pivotal properties—injectivity of the attention mapping and robust local modeling—as the main reasons for Softmax's superior performance and linear attention's limitations. The authors prove Softmax is injective under mild conditions while linear attention is not, and they introduce InLine, an injective linear attention, augmented with a local-bias residual to strengthen local modeling. Empirical validation across Swin, DeiT, PVT, and CSWin on ImageNet-1K, COCO, and ADE20K shows InLine variants can outperform Softmax baselines with comparable or lower FLOPs and improved throughput at larger window sizes. This work provides a principled route to scaling Vision Transformers to high-resolution inputs without sacrificing accuracy or excessively increasing computation.

Abstract

Widely adopted in modern Vision Transformer designs, Softmax attention can effectively capture long-range visual information; however, it incurs excessive computational cost when dealing with high-resolution inputs. In contrast, linear attention naturally enjoys linear complexity and has great potential to scale up to higher-resolution images. Nonetheless, the unsatisfactory performance of linear attention greatly limits its practical application in various scenarios. In this paper, we take a step forward to close the gap between the linear and Softmax attention with novel theoretical analyses, which demystify the core factors behind the performance deviations. Specifically, we present two key perspectives to understand and alleviate the limitations of linear attention: the injective property and the local modeling ability. Firstly, we prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors, thus adding to severe semantic confusion since different queries correspond to the same outputs. Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short. The aforementioned two fundamental differences significantly contribute to the disparities between these two attention paradigms, which is demonstrated by our substantial empirical validation in the paper. In addition, more experiment results indicate that linear attention, as long as endowed with these two properties, can outperform Softmax attention across various tasks while maintaining lower computation complexity. Code is available at https://github.com/LeapLabTHU/InLine.

Paper Structure

This paper contains 20 sections, 10 equations, 5 figures, 17 tables.

Figures (5)

  • Figure 1: An illustration of injective property and confusion problem. Non-injectivity leads to various semantic confusions in linear attention when different kernel functions are employed. (a) With $\phi(\cdot)=\mathrm{ReLU}(\cdot)$, linear attention assigns the same attention values to collinear queries of varying lengths. (b) Using $\phi(\cdot)=\mathrm{ReLU}(A\cdot+b)$, linear attention faces severe confusion problem, producing identical attention distribution for certain queries with different directions and lengths.
  • Figure 2: The distribution of the number of times each image encounters confusion during inference.
  • Figure 3: The sum of attention scores in the local $3\!\times\!3$ windows of each query from DeiT-T.
  • Figure 4: Visualizations of attention distributions. Softmax attention exhibits strong local bias. The other two attention types yield meaningful attention distributions, but focus more on global modeling.
  • Figure 5: Speed measurements. Runtime and FPS is tested on a RTX3090 GPU. (a) Accuracy-Runtime curve on ImageNet. (b) Increasing window size. (c) High-resolution scenarios.

Theorems & Definitions (4)

  • proof
  • proof
  • proof
  • proof