Table of Contents
Fetching ...

MirrorLA: Reflecting Feature Map for Vision Linear Attention

Weikang Meng, Liangyu Huo, Yadan Luo, Yaowei Wang, Yingjian Li, Zheng Zhang

TL;DR

This work tackles the gap between linear attention and softmax-based transformers by addressing the information loss caused by non-negative kernel feature maps. It introduces MirrorLA, a geometry-aware framework that uses learnable Householder reflections to actively rotate feature maps into the non-negative orthant, preserving information while maintaining linear time complexity. The method combines block-wise isometries, variance-aware angle modulation, and cross-head reflections to enhance local discriminability, long-context diversity, and global head interaction. Across diverse vision tasks, MirrorLA achieves state-of-the-art results with reduced memory and latency, demonstrating that linear efficiency can coexist with high representational fidelity.

Abstract

Linear attention significantly reduces the computational complexity of Transformers from quadratic to linear, yet it consistently lags behind softmax-based attention in performance. We identify the root cause of this degradation as the non-negativity constraint imposed on kernel feature maps: standard projections like ReLU act as "passive truncation" operators, indiscriminately discarding semantic information residing in the negative domain. We propose MirrorLA, a geometric framework that substitutes passive truncation with active reorientation. By leveraging learnable Householder reflections, MirrorLA rotates the feature geometry into the non-negative orthant to maximize information retention. Our approach restores representational density through a cohesive, multi-scale design: it first optimizes local discriminability via block-wise isometries, stabilizes long-context dynamics using variance-aware modulation to diversify activations, and finally, integrates dispersed subspaces via cross-head reflections to induce global covariance mixing. MirrorLA achieves state-of-the-art performance across standard benchmarks, demonstrating that strictly linear efficiency can be achieved without compromising representational fidelity.

MirrorLA: Reflecting Feature Map for Vision Linear Attention

TL;DR

This work tackles the gap between linear attention and softmax-based transformers by addressing the information loss caused by non-negative kernel feature maps. It introduces MirrorLA, a geometry-aware framework that uses learnable Householder reflections to actively rotate feature maps into the non-negative orthant, preserving information while maintaining linear time complexity. The method combines block-wise isometries, variance-aware angle modulation, and cross-head reflections to enhance local discriminability, long-context diversity, and global head interaction. Across diverse vision tasks, MirrorLA achieves state-of-the-art results with reduced memory and latency, demonstrating that linear efficiency can coexist with high representational fidelity.

Abstract

Linear attention significantly reduces the computational complexity of Transformers from quadratic to linear, yet it consistently lags behind softmax-based attention in performance. We identify the root cause of this degradation as the non-negativity constraint imposed on kernel feature maps: standard projections like ReLU act as "passive truncation" operators, indiscriminately discarding semantic information residing in the negative domain. We propose MirrorLA, a geometric framework that substitutes passive truncation with active reorientation. By leveraging learnable Householder reflections, MirrorLA rotates the feature geometry into the non-negative orthant to maximize information retention. Our approach restores representational density through a cohesive, multi-scale design: it first optimizes local discriminability via block-wise isometries, stabilizes long-context dynamics using variance-aware modulation to diversify activations, and finally, integrates dispersed subspaces via cross-head reflections to induce global covariance mixing. MirrorLA achieves state-of-the-art performance across standard benchmarks, demonstrating that strictly linear efficiency can be achieved without compromising representational fidelity.
Paper Structure (21 sections, 1 theorem, 18 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 21 sections, 1 theorem, 18 equations, 5 figures, 9 tables, 1 algorithm.

Key Result

theorem 1

Collapse-aware Activation Diversificationvariance_shift Let $\mathcal{X}=\{\mathbf{x}_{t,m}\}_{t=1}^{L}\subset\mathbb{R}^{2}$ be the token features of a 2D block with empirical variance $\sigma^{2}$ computed over the token dimension. Consider the modulated mirror map where $\Delta\alpha(\sigma^2)=\mathrm{sigmoid}(1/(\sigma^2+\varepsilon))\cdot\alpha$ and $\mathbf{H}(\Theta_m)$ denotes a 2D Househ

Figures (5)

  • Figure 1: PCA Visualization of Feature Topology. We visualize $\operatorname{Softmax}(\phi(\mathbf{Q})\phi(\mathbf{K})^\top)$ under constant normalization across three paradigms: (a) Vanilla Attention, $\phi(\mathbf{x})=\mathbf{x}$ or $\mathbf{Hx}$; (b) Passive Truncation$\phi(\mathbf{x})=\operatorname{ReLU}(\mathbf{x})$ results in "dead" dimension and information loss; (c) Active Reorientation, $\phi(\mathbf{x})=\operatorname{ReLU}(\mathbf{Hx})$ employs an isometric Householder reflection. This aligns informative features with the positive orthant, recovering the rich structural details lost in (b).
  • Figure 2: Overview of the MirrorLA framework. (a1) In 2D, a learnable Householder reflection $\mathbf{H}$ reorients features $\mathbf{Hq}$$\mathbf{Hk}$ before applying an axis-aligned non-negativity map, converting passive truncation into active alignment while preserving inner products by isometry. (a2) Extension to $D$-dim via block-wise and reflection with low overhead (b) Adaptively adjusting reflection angles based on block variance $\sigma^2$ to enhance feature diversity. (c) A global transformation before head-wise decomposition to encourage inter-head communication.
  • Figure 3: Visualization of Semantic Segmentation and Super-Resolution (SR) tasks. Left: Comparative results on the Cityscapes dataset, where MirrorLA achieves superior segmentation integrity, whereas competing methods suffer from incomplete masks. Right: Comparison of SR performance; MirrorLA reconstructs details more effectively, while DCTLSA introduces noticeable structural distortions.
  • Figure 4: More visualizations on super-resolution tasks. Our model produces more clear boundaries, more accurate shapes, and finer-grained textures compared to baselines.
  • Figure 5: Visualization results on semantic segmention. Compared to other methods, our approach more accurately delineates different objects, effectively avoiding missing predictions and boundary ambiguities.

Theorems & Definitions (1)

  • theorem 1