The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures
Alexander M. Fichtl, Jeremias Bohn, Josefin Kelber, Edoardo Mosca, Georg Groh
TL;DR
The paper addresses the computational bottleneck of quadratic attention in transformers and surveys sub-quadratic alternatives, including non-recurrent attention variants, linear RNNs, SSMs, and hybrids. It systematically analyzes time/space complexity, benchmark results, and fundamental limitations, offering a taxonomy, comparative insights, and guidance on where pure attention may still dominate versus where hybrids and memory-centric designs shine. Key contributions include a structured overview of sub-quadratic approaches, a complexity/benchmark synthesis, and an architectural-limitations perspective that contextualizes practical trade-offs for long-context NLP. The findings suggest that while sub-quadratic methods improve efficiency and enable long contexts in edge/mid regimes, they do not yet supersede full attention on frontier-scale tasks; the pragmatic path forward lies in diversified architectures and memory-driven hybrids rather than wholesale transformer replacement.
Abstract
Transformers have dominated sequence processing tasks for the past seven years -- most notably language modeling. However, the inherent quadratic complexity of their attention mechanism remains a significant bottleneck as context length increases. This paper surveys recent efforts to overcome this bottleneck, including advances in (sub-quadratic) attention variants, recurrent neural networks, state space models, and hybrid architectures. We critically analyze these approaches in terms of compute and memory complexity, benchmark results, and fundamental limitations to assess whether the dominance of pure-attention transformers may soon be challenged.
