Table of Contents
Fetching ...

Breaking the Low-Rank Dilemma of Linear Attention

Qihang Fan, Huaibo Huang, Ran He

TL;DR

This paper addresses the performance gap between Softmax attention and linear attention in vision tasks by identifying the low-rank nature of linear attention's output as a core limitation. It introduces Rank-Augmented Linear Attention (RALA), which augments both the KV buffer and the output features to restore rank while preserving linear computational complexity, and builds the Rank-Augmented Vision Linear Transformer (RAVLT) on this backbone. Empirical results across image classification, object detection/instance segmentation, and semantic segmentation show RAVLT delivering competitive or superior performance with markedly better efficiency than existing linear-attention methods, including achieving 84.4% Top-1 on ImageNet-1K with only 26M parameters and 4.6G FLOPs. The approach demonstrates that strategic rank augmentation can unlock the potential of linear attention for practical, high-performance vision transformers.

Abstract

The Softmax attention mechanism in Transformer models is notoriously computationally expensive, particularly due to its quadratic complexity, posing significant challenges in vision applications. In contrast, linear attention provides a far more efficient solution by reducing the complexity to linear levels. However, compared to Softmax attention, linear attention often experiences significant performance degradation. Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map, which hinders its ability to adequately model complex spatial information. In this paper, to break the low-rank dilemma of linear attention, we conduct rank analysis from two perspectives: the KV buffer and the output features. Consequently, we introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency. Based on RALA, we construct the Rank-Augmented Vision Linear Transformer (RAVLT). Extensive experiments demonstrate that RAVLT achieves excellent performance across various vision tasks. Specifically, without using any additional labels, data, or supervision during training, RAVLT achieves an 84.4% Top-1 accuracy on ImageNet-1k with only 26M parameters and 4.6G FLOPs. This result significantly surpasses previous linear attention mechanisms, fully illustrating the potential of RALA. Code will be available at https://github.com/qhfan/RALA.

Breaking the Low-Rank Dilemma of Linear Attention

TL;DR

This paper addresses the performance gap between Softmax attention and linear attention in vision tasks by identifying the low-rank nature of linear attention's output as a core limitation. It introduces Rank-Augmented Linear Attention (RALA), which augments both the KV buffer and the output features to restore rank while preserving linear computational complexity, and builds the Rank-Augmented Vision Linear Transformer (RAVLT) on this backbone. Empirical results across image classification, object detection/instance segmentation, and semantic segmentation show RAVLT delivering competitive or superior performance with markedly better efficiency than existing linear-attention methods, including achieving 84.4% Top-1 on ImageNet-1K with only 26M parameters and 4.6G FLOPs. The approach demonstrates that strategic rank augmentation can unlock the potential of linear attention for practical, high-performance vision transformers.

Abstract

The Softmax attention mechanism in Transformer models is notoriously computationally expensive, particularly due to its quadratic complexity, posing significant challenges in vision applications. In contrast, linear attention provides a far more efficient solution by reducing the complexity to linear levels. However, compared to Softmax attention, linear attention often experiences significant performance degradation. Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map, which hinders its ability to adequately model complex spatial information. In this paper, to break the low-rank dilemma of linear attention, we conduct rank analysis from two perspectives: the KV buffer and the output features. Consequently, we introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency. Based on RALA, we construct the Rank-Augmented Vision Linear Transformer (RAVLT). Extensive experiments demonstrate that RAVLT achieves excellent performance across various vision tasks. Specifically, without using any additional labels, data, or supervision during training, RAVLT achieves an 84.4% Top-1 accuracy on ImageNet-1k with only 26M parameters and 4.6G FLOPs. This result significantly surpasses previous linear attention mechanisms, fully illustrating the potential of RALA. Code will be available at https://github.com/qhfan/RALA.

Paper Structure

This paper contains 42 sections, 16 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Comparison of Softmax attention and linear attention. Linear attention has linear complexity and high efficiency, but its spatial modeling capability is inferior to Softmax attention.
  • Figure 2: Comparison of feature maps output by Softmax attention and different linear attentions. All experiments are conducted based on the DeiT-T architecture, with $N=196$ and $d=64$. The full rank of matrices in the fig is 64. Compared to Softmax attention, the output features of various linear attentions exhibit significantly low-rank properties. This indicates that the diversity of features learned by linear attention is inferior to that learned by Softmax attention.
  • Figure 3: Comparison among models based on linear attention and Softmax attention. Our RAVLT achieves state-of-the-art results across all scales and significantly outperforms existing vision models based on linear attention.
  • Figure 4: Visualization of the rank analysis of the KV buffer for different linear attention mechanisms. The KV buffer $\sum_{j=1}^{N}\kappa(K_j)^TV_j \in \mathbb{R}^{64\times 64}$.
  • Figure 5: Visualization of the output features' ($Y\in \mathbb{R}^{N\times d}$, $N=196$, $d=64$) rank analysis.
  • ...and 3 more figures