Table of Contents
Fetching ...

Linear Attention for Efficient Bidirectional Sequence Modeling

Arshia Afzal, Elias Abad Rocamora, Leyla Naz Candogan, Pol Puigdemont, Francesco Tonin, Yongtao Wu, Mahsa Shoaran, Volkan Cevher

TL;DR

This paper addresses the challenge of applying Linear Transformers to bidirectional sequence modeling by introducing LION, a unified framework that unifies full Linear Attention, bidirectional RNN, and chunkwise parallel forms under a single theory. It demonstrates how existing linear-transformer variants (e.g., Vanilla Linear Transformer, RetNet, Mamba-like models) can be mapped into LION with either diagonal or selective decay masks, and presents three instantiations—Lion-lit, Lion-d, and Lion-s—covering no-decay, fixed-decay, and selective-decay regimes. Through extensive experiments on ImageNet, C4 MLM, and Long Range Arena, LION variants achieve competitive or superior performance relative to softmax Transformers and prior SSMs, while delivering significantly faster training and more memory-efficient inference, especially in RNN form or chunked implementations. The framework enables flexible trade-offs between training speed and inference efficiency, and demonstrates practical mappings and ablations across diverse tasks, marking a substantial step toward scalable bidirectional Linear Transformers for both vision and language domains.

Abstract

Linear Transformers and State Space Models have emerged as efficient alternatives to softmax Transformers for causal sequence modeling, enabling parallel training via matrix multiplication and efficient RNN-style inference. However, despite their success in causal tasks, no unified framework exists for applying Linear Transformers to bidirectional sequence modeling. We introduce LION, the first framework to systematically extend Linear Transformers to the bidirectional setting. LION generalizes three core representations commonly used in the causal case - full Linear Attention , bidirectional RNN, and chunkwise parallel form - to the bidirectional setting. These forms are theoretically equivalent and enable models to exploit the strengths of each during training and inference. We prove that a broad class of Linear Transformers can be extended using LION and validate our framework via three core examples based on the choice of decay type: LION-LIT, the bidirectional extension of arXiv:2006.16236; LION-D, based on arXiv:2307.08621; and LION-S, a variant using selective decay arXiv:2103.02143, arXiv:2312.0075. Across standard bidirectional tasks, LION enables models to match or exceed the performance of softmax Transformers, while offering significantly faster training and more efficient inference than existing State Space Models.

Linear Attention for Efficient Bidirectional Sequence Modeling

TL;DR

This paper addresses the challenge of applying Linear Transformers to bidirectional sequence modeling by introducing LION, a unified framework that unifies full Linear Attention, bidirectional RNN, and chunkwise parallel forms under a single theory. It demonstrates how existing linear-transformer variants (e.g., Vanilla Linear Transformer, RetNet, Mamba-like models) can be mapped into LION with either diagonal or selective decay masks, and presents three instantiations—Lion-lit, Lion-d, and Lion-s—covering no-decay, fixed-decay, and selective-decay regimes. Through extensive experiments on ImageNet, C4 MLM, and Long Range Arena, LION variants achieve competitive or superior performance relative to softmax Transformers and prior SSMs, while delivering significantly faster training and more memory-efficient inference, especially in RNN form or chunked implementations. The framework enables flexible trade-offs between training speed and inference efficiency, and demonstrates practical mappings and ablations across diverse tasks, marking a substantial step toward scalable bidirectional Linear Transformers for both vision and language domains.

Abstract

Linear Transformers and State Space Models have emerged as efficient alternatives to softmax Transformers for causal sequence modeling, enabling parallel training via matrix multiplication and efficient RNN-style inference. However, despite their success in causal tasks, no unified framework exists for applying Linear Transformers to bidirectional sequence modeling. We introduce LION, the first framework to systematically extend Linear Transformers to the bidirectional setting. LION generalizes three core representations commonly used in the causal case - full Linear Attention , bidirectional RNN, and chunkwise parallel form - to the bidirectional setting. These forms are theoretically equivalent and enable models to exploit the strengths of each during training and inference. We prove that a broad class of Linear Transformers can be extended using LION and validate our framework via three core examples based on the choice of decay type: LION-LIT, the bidirectional extension of arXiv:2006.16236; LION-D, based on arXiv:2307.08621; and LION-S, a variant using selective decay arXiv:2103.02143, arXiv:2312.0075. Across standard bidirectional tasks, LION enables models to match or exceed the performance of softmax Transformers, while offering significantly faster training and more efficient inference than existing State Space Models.

Paper Structure

This paper contains 48 sections, 4 theorems, 134 equations, 15 figures, 17 tables.

Key Result

Theorem 3.1

( Lion-RNN) Since eq:fullatt is the parallel form of the recurrence presented in eq:lrm1, we can therefore express the equivalent recurrence for full attention eq:fullatt as follows: The terms $\frac{1}{2}{{\mathbf{q}_i}}^{\top} \mathbf{k}_i \mathbf{v}_i$ and $\frac{1}{2}{{\mathbf{q}_i}}^{\top} \mathbf{k}_i$ are subtracted to avoid double counting.

Figures (15)

  • Figure 1: Memory-speed tradeoffs for Lion in three representations on Imagenet classification task: Full attention, RNN, and chunkwise from for Left)Lion-lit, Middle)Lion-d, and Right)Lion-s.
  • Figure 2: (Left) Standard Transformer block. (Middle) Training mode of Lion with full linear attention. (Right) Inference mode of Lion in the equivalent bidirectional RNN.
  • Figure 3: C4 MLM and GLUE results for the LARGE scale ($334$M). Best and second-best results are in bold and underline, respectively.
  • Figure 4: Effect of chunksize in GPU memory and Speed of Lion-chunk for Lion-d.
  • Figure 5: Causal Language Modelling results in the GPT-2 128M size.(a) Perplexity in the OpenWebText dataset. (b) Perplexity vs. sequence length in OpenWebText. Lion-s improve over the LinAtt baseline trans_rnn while obtaining similar performance to the GPT baseline and being able to extrapolate to larger context lengths than the one used during training.
  • ...and 10 more figures

Theorems & Definitions (6)

  • Remark
  • Theorem 3.1
  • Theorem 3.2
  • Remark
  • Theorem B.1
  • Theorem B.2