Bridging the Divide: Reconsidering Softmax and Linear Attention

Dongchen Han; Yifan Pu; Zhuofan Xia; Yizeng Han; Xuran Pan; Xiu Li; Jiwen Lu; Shiji Song; Gao Huang

Bridging the Divide: Reconsidering Softmax and Linear Attention

Dongchen Han, Yifan Pu, Zhuofan Xia, Yizeng Han, Xuran Pan, Xiu Li, Jiwen Lu, Shiji Song, Gao Huang

TL;DR

The paper confronts the efficiency gap between Softmax attention ($O(N^2)$) and linear attention ($O(N)$) in Vision Transformers. It identifies two pivotal properties—injectivity of the attention mapping and robust local modeling—as the main reasons for Softmax's superior performance and linear attention's limitations. The authors prove Softmax is injective under mild conditions while linear attention is not, and they introduce InLine, an injective linear attention, augmented with a local-bias residual to strengthen local modeling. Empirical validation across Swin, DeiT, PVT, and CSWin on ImageNet-1K, COCO, and ADE20K shows InLine variants can outperform Softmax baselines with comparable or lower FLOPs and improved throughput at larger window sizes. This work provides a principled route to scaling Vision Transformers to high-resolution inputs without sacrificing accuracy or excessively increasing computation.

Abstract

Widely adopted in modern Vision Transformer designs, Softmax attention can effectively capture long-range visual information; however, it incurs excessive computational cost when dealing with high-resolution inputs. In contrast, linear attention naturally enjoys linear complexity and has great potential to scale up to higher-resolution images. Nonetheless, the unsatisfactory performance of linear attention greatly limits its practical application in various scenarios. In this paper, we take a step forward to close the gap between the linear and Softmax attention with novel theoretical analyses, which demystify the core factors behind the performance deviations. Specifically, we present two key perspectives to understand and alleviate the limitations of linear attention: the injective property and the local modeling ability. Firstly, we prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors, thus adding to severe semantic confusion since different queries correspond to the same outputs. Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short. The aforementioned two fundamental differences significantly contribute to the disparities between these two attention paradigms, which is demonstrated by our substantial empirical validation in the paper. In addition, more experiment results indicate that linear attention, as long as endowed with these two properties, can outperform Softmax attention across various tasks while maintaining lower computation complexity. Code is available at https://github.com/LeapLabTHU/InLine.

Bridging the Divide: Reconsidering Softmax and Linear Attention

TL;DR

Abstract

Bridging the Divide: Reconsidering Softmax and Linear Attention

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)

Theorems & Definitions (4)