MirrorLA: Reflecting Feature Map for Vision Linear Attention
Weikang Meng, Liangyu Huo, Yadan Luo, Yaowei Wang, Yingjian Li, Zheng Zhang
TL;DR
This work tackles the gap between linear attention and softmax-based transformers by addressing the information loss caused by non-negative kernel feature maps. It introduces MirrorLA, a geometry-aware framework that uses learnable Householder reflections to actively rotate feature maps into the non-negative orthant, preserving information while maintaining linear time complexity. The method combines block-wise isometries, variance-aware angle modulation, and cross-head reflections to enhance local discriminability, long-context diversity, and global head interaction. Across diverse vision tasks, MirrorLA achieves state-of-the-art results with reduced memory and latency, demonstrating that linear efficiency can coexist with high representational fidelity.
Abstract
Linear attention significantly reduces the computational complexity of Transformers from quadratic to linear, yet it consistently lags behind softmax-based attention in performance. We identify the root cause of this degradation as the non-negativity constraint imposed on kernel feature maps: standard projections like ReLU act as "passive truncation" operators, indiscriminately discarding semantic information residing in the negative domain. We propose MirrorLA, a geometric framework that substitutes passive truncation with active reorientation. By leveraging learnable Householder reflections, MirrorLA rotates the feature geometry into the non-negative orthant to maximize information retention. Our approach restores representational density through a cohesive, multi-scale design: it first optimizes local discriminability via block-wise isometries, stabilizes long-context dynamics using variance-aware modulation to diversify activations, and finally, integrates dispersed subspaces via cross-head reflections to induce global covariance mixing. MirrorLA achieves state-of-the-art performance across standard benchmarks, demonstrating that strictly linear efficiency can be achieved without compromising representational fidelity.
