Table of Contents
Fetching ...

Norm$\times$Direction: Restoring the Missing Query Norm in Vision Linear Attention

Weikang Meng, Yadan Luo, Liangyu Huo, Yingjian Li, Yaowei Wang, Xin Li, Zheng Zhang

TL;DR

NaLaFormer addresses the expressiveness gap of linear attention by reintroducing the query norm as a control on attention entropy and by preserving information through a cosine-direction non-negativity mechanism. The approach combines a query-norm-aware feature map with a cosine-based direction interaction under an ND decomposition, yielding a unified Norm×Direction linear attention. Empirically, it delivers state-of-the-art or competitive results across ImageNet, COCO, ADE20K, DIV2K SR, diffusion models, and language tasks, while achieving substantial memory and latency reductions in token-heavy settings. The work demonstrates broad applicability and practical impact for scalable, efficient transformers in vision and beyond.

Abstract

Linear attention mitigates the quadratic complexity of softmax attention but suffers from a critical loss of expressiveness. We identify two primary causes: (1) The normalization operation cancels the query norm, which breaks the correlation between a query's norm and the spikiness (entropy) of the attention distribution as in softmax attention. (2) Standard techniques for enforcing non-negativity cause destructive information loss by nullifying valid inner-product interactions. To address these challenges, we introduce NaLaFormer, a novel linear attention mechanism built upon a norm$\times$direction (ND) decomposition of the query and key vectors. We leverage each component to solve a distinct problem: The query norm is injected into our kernel to create a query-norm-aware map that restores the attention distribution's spikiness. The direction vectors are processed by a geometric, cosine-based similarity metric that guarantees non-negativity while preserving the rich, fine-grained information of the inner product. We validate NaLaFormer through a comprehensive multi-modal evaluation, where it sets new state-of-the-art benchmarks for linear attention. Our model achieves up to a 7.5% accuracy gain on ImageNet-1K and a 4.7% mIoU improvement on ADE20K over comparable baselines. It demonstrates profound efficiency, reducing peak memory by a transformative 92.3% in token-intensive super-resolution tasks (70K+ tokens). NaLaFormer's versatility is further confirmed as it surpasses strong baselines like Mamba on common-sense reasoning and sets a new state-of-the-art on the Long Range Arena (LRA) benchmark. Source code can be found in the supplementary materials.

Norm$\times$Direction: Restoring the Missing Query Norm in Vision Linear Attention

TL;DR

NaLaFormer addresses the expressiveness gap of linear attention by reintroducing the query norm as a control on attention entropy and by preserving information through a cosine-direction non-negativity mechanism. The approach combines a query-norm-aware feature map with a cosine-based direction interaction under an ND decomposition, yielding a unified Norm×Direction linear attention. Empirically, it delivers state-of-the-art or competitive results across ImageNet, COCO, ADE20K, DIV2K SR, diffusion models, and language tasks, while achieving substantial memory and latency reductions in token-heavy settings. The work demonstrates broad applicability and practical impact for scalable, efficient transformers in vision and beyond.

Abstract

Linear attention mitigates the quadratic complexity of softmax attention but suffers from a critical loss of expressiveness. We identify two primary causes: (1) The normalization operation cancels the query norm, which breaks the correlation between a query's norm and the spikiness (entropy) of the attention distribution as in softmax attention. (2) Standard techniques for enforcing non-negativity cause destructive information loss by nullifying valid inner-product interactions. To address these challenges, we introduce NaLaFormer, a novel linear attention mechanism built upon a normdirection (ND) decomposition of the query and key vectors. We leverage each component to solve a distinct problem: The query norm is injected into our kernel to create a query-norm-aware map that restores the attention distribution's spikiness. The direction vectors are processed by a geometric, cosine-based similarity metric that guarantees non-negativity while preserving the rich, fine-grained information of the inner product. We validate NaLaFormer through a comprehensive multi-modal evaluation, where it sets new state-of-the-art benchmarks for linear attention. Our model achieves up to a 7.5% accuracy gain on ImageNet-1K and a 4.7% mIoU improvement on ADE20K over comparable baselines. It demonstrates profound efficiency, reducing peak memory by a transformative 92.3% in token-intensive super-resolution tasks (70K+ tokens). NaLaFormer's versatility is further confirmed as it surpasses strong baselines like Mamba on common-sense reasoning and sets a new state-of-the-art on the Long Range Arena (LRA) benchmark. Source code can be found in the supplementary materials.

Paper Structure

This paper contains 26 sections, 2 theorems, 27 equations, 11 figures, 17 tables.

Key Result

theorem 1

Query Norm-aware Entropy Reduction in Softmax Attention123 Given that $\mathbf{x}_i=\mathbf{qk}_i^\top$ be a positive sequence and let $\Phi:(-\infty,+\infty )\mapsto [0,+\infty )$ be a spiky function serving to reduce the PSE through mapping each $x_i$. In the case $\Phi(\cdot)=\operatorname{exp}(\

Figures (11)

  • Figure 1: Entropy-norm correlation in softmax attention. We plot the relationship between feature entropy and vector norms in a Swin Transformer sampled on ImageNet. The top row shows $\mathbf{q}$-norms ($x$-axis) exhibit a strong negative correlation with attention entropy ($y$-axis). The bottom row shows that $\mathbf{k}$-norms have no consistent effect. This observation suggests that the entropy diminishing in linear attention may stem from insufficient query scaling, pointing to the key for restoring spikiness.
  • Figure 2: The NaLaFormer architecture and its core mechanisms. (a) The NaLaFormer block incorporates a simplified GLA and custom feature maps $\phi_q$ and $\phi_k$. (b) Our norm-aware method (right) restores the negative query norm-entropy correlation lost in standard linear attention (left). (c) The cosine direction mechanism enforces non-negativity by decomposing similarity into norm and direction components, preventing information loss.
  • Figure 3: We visualize the query norm-entropy relationship under three approaches: (1) Only preserve non-negativity with $1+\operatorname{ELU}$ operator linearattn. (2) Keep both non-negativity and spikiness with $\operatorname{ReLU}$ operator and power function as in FLatten flatten. (3) Our q-norm-aware approach shows a clear correlation between entropy and query norms.
  • Figure 4: Comparisons of element-wise dot product contributions for different non-negative strategies. The plots show $\mathbf{q}_i\mathbf{k}_i$ value for (1) the raw inputs, (2) our novel cosine-based approach, (3) $\operatorname{ReLU}$ activation, and (4) $1+\mathrm{ELU}$ activation. Our approach ensures all dimensional contributions are non-negative while retaining the fine-grained "spikiness" observed in the original product.
  • Figure 5: Visualizations illustrating NaLaFormer’s semantic segmentation results on the CityScapes dataset (left) and NaLaSR and ESRT’s super-resolution results on the Urban100 benchmark (right).
  • ...and 6 more figures

Theorems & Definitions (6)

  • definition 1
  • definition 2
  • theorem 1
  • proof
  • theorem 2
  • proof