Table of Contents
Fetching ...

SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention

Yuan Cao, Dong Wang

TL;DR

SAGA is proposed, which introduces input-adaptive learnable gates to selectively modulate information aggregation into the $KV$ feature map, which enhances semantic diversity and alleviate the low-rank constraint inherent in conventional linear attention.

Abstract

While Transformer architecture excel at modeling long-range dependencies contributing to its widespread adoption in vision tasks the quadratic complexity of softmax-based attention mechanisms imposes a major bottleneck, particularly when processing high-resolution images. Linear attention presents a promising alternative by reformulating the attention computation from $(QK)V$ to $Q(KV)$, thereby reducing the complexity from $\mathcal{O}(N^2)$ to $\mathcal{O}(N)$ while preserving the global receptive field. However, most existing methods compress historical key-value (KV) information uniformly, which can lead to feature redundancy and the loss of directional alignment with the query (Q). This uniform compression results in low-rank $KV$ feature maps, contributing to a performance gap compared to softmax attention. To mitigate this limitation, we propose \textbf{S}elective \textbf{A}daptive \textbf{GA}ting for Efficient and Expressive Linear Attention (SAGA) , which introduces input-adaptive learnable gates to selectively modulate information aggregation into the $KV$ feature map. These gates enhance semantic diversity and alleviate the low-rank constraint inherent in conventional linear attention. Additionally, we propose an efficient Hadamard-product decomposition method for gate computation, which introduces no additional memory overhead. Experiments demonstrate that SAGA achieves a 1.76$\times$ improvement in throughput and a 2.69$\times$ reduction in peak GPU memory compared to PVT-T at a resolution of $1280 \times 1280$. Moreover, it improves top-1 accuracy by up to 4.4\% on the ImageNet dataset, demonstrating both computational efficiency and model effectiveness.

SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention

TL;DR

SAGA is proposed, which introduces input-adaptive learnable gates to selectively modulate information aggregation into the feature map, which enhances semantic diversity and alleviate the low-rank constraint inherent in conventional linear attention.

Abstract

While Transformer architecture excel at modeling long-range dependencies contributing to its widespread adoption in vision tasks the quadratic complexity of softmax-based attention mechanisms imposes a major bottleneck, particularly when processing high-resolution images. Linear attention presents a promising alternative by reformulating the attention computation from to , thereby reducing the complexity from to while preserving the global receptive field. However, most existing methods compress historical key-value (KV) information uniformly, which can lead to feature redundancy and the loss of directional alignment with the query (Q). This uniform compression results in low-rank feature maps, contributing to a performance gap compared to softmax attention. To mitigate this limitation, we propose \textbf{S}elective \textbf{A}daptive \textbf{GA}ting for Efficient and Expressive Linear Attention (SAGA) , which introduces input-adaptive learnable gates to selectively modulate information aggregation into the feature map. These gates enhance semantic diversity and alleviate the low-rank constraint inherent in conventional linear attention. Additionally, we propose an efficient Hadamard-product decomposition method for gate computation, which introduces no additional memory overhead. Experiments demonstrate that SAGA achieves a 1.76 improvement in throughput and a 2.69 reduction in peak GPU memory compared to PVT-T at a resolution of . Moreover, it improves top-1 accuracy by up to 4.4\% on the ImageNet dataset, demonstrating both computational efficiency and model effectiveness.

Paper Structure

This paper contains 24 sections, 34 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Comparison of rank curves between $Relu\left( {{\cdot}} \right)$, Flatten Attention and our SAGA.
  • Figure 2: The process of $K^T V$ essentially involves the summation of $N$ matrices of size $d_k \times d_v$, which can be interpreted as the information fusion of the SFM corresponding to $N$ tokens. (a) Traditional linear attention aggregates all SFMs indiscriminately, which leads to substantial information redundancy. (b)We introduced a learnable gate layer for each SFM to refine the information flow entering $KV$ feature map.
  • Figure 3: The overall architecture of SAGA.
  • Figure 4: AblationCAM of SAGA and Linear Attention.
  • Figure 5: Efficiency Comparison between LLFormer and SAGA. All reported data are measured using a single RTX 4090 GPU.
  • ...and 4 more figures