Linear Attention for Efficient Bidirectional Sequence Modeling
Arshia Afzal, Elias Abad Rocamora, Leyla Naz Candogan, Pol Puigdemont, Francesco Tonin, Yongtao Wu, Mahsa Shoaran, Volkan Cevher
TL;DR
This paper addresses the challenge of applying Linear Transformers to bidirectional sequence modeling by introducing LION, a unified framework that unifies full Linear Attention, bidirectional RNN, and chunkwise parallel forms under a single theory. It demonstrates how existing linear-transformer variants (e.g., Vanilla Linear Transformer, RetNet, Mamba-like models) can be mapped into LION with either diagonal or selective decay masks, and presents three instantiations—Lion-lit, Lion-d, and Lion-s—covering no-decay, fixed-decay, and selective-decay regimes. Through extensive experiments on ImageNet, C4 MLM, and Long Range Arena, LION variants achieve competitive or superior performance relative to softmax Transformers and prior SSMs, while delivering significantly faster training and more memory-efficient inference, especially in RNN form or chunked implementations. The framework enables flexible trade-offs between training speed and inference efficiency, and demonstrates practical mappings and ablations across diverse tasks, marking a substantial step toward scalable bidirectional Linear Transformers for both vision and language domains.
Abstract
Linear Transformers and State Space Models have emerged as efficient alternatives to softmax Transformers for causal sequence modeling, enabling parallel training via matrix multiplication and efficient RNN-style inference. However, despite their success in causal tasks, no unified framework exists for applying Linear Transformers to bidirectional sequence modeling. We introduce LION, the first framework to systematically extend Linear Transformers to the bidirectional setting. LION generalizes three core representations commonly used in the causal case - full Linear Attention , bidirectional RNN, and chunkwise parallel form - to the bidirectional setting. These forms are theoretically equivalent and enable models to exploit the strengths of each during training and inference. We prove that a broad class of Linear Transformers can be extended using LION and validate our framework via three core examples based on the choice of decay type: LION-LIT, the bidirectional extension of arXiv:2006.16236; LION-D, based on arXiv:2307.08621; and LION-S, a variant using selective decay arXiv:2103.02143, arXiv:2312.0075. Across standard bidirectional tasks, LION enables models to match or exceed the performance of softmax Transformers, while offering significantly faster training and more efficient inference than existing State Space Models.
