Don't Pay Attention
Mohammad Hammoud, Devang Acharya
TL;DR
The paper tackles the Transformer bottleneck of fixed context and quadratic self-attention by introducing Avey, a ranker plus neural processor architecture that decouples sequence length from context width, enabling efficient processing of arbitrarily long sequences. It demonstrates that Avey can match or exceed Transformer performance on short-range tasks and markedly outperform it on long-range dependency benchmarks, thanks to a weighted-selective-split mechanism and dynamic contextualization across splits. Through extensive ablations and a cascaded design-search, the authors show the importance of embedding expansion, partial-embedding bypassing, and ranker-guided contextualization, while revealing strong extrapolation capabilities on S-NIAH benchmarks. The work suggests a scalable route for language modeling with practical latency benefits and provides open-source code and pretrained checkpoints for reproducibility and further research.
Abstract
The Transformer has become the de facto standard for modern language models owing to its parallelizable training and effective autoregressive decoding. However, its fixed context window and the quadratic time and memory costs of its self-attention mechanism remain central bottlenecks. These constraints have revived interest in recurrent architectures that scale linearly with sequence length, but at the cost of reduced parallelism. In this paper, we introduce Avey, a new foundational architecture that breaks away from both attention and recurrence. Avey pairs a ranker with an autoregressive neural processor to select and contextualize only the most relevant tokens for any given token. Specifically, it decouples sequence length from context width, thus enabling effective and efficient processing of arbitrarily long sequences. Results show that Avey compares favorably to the Transformer across a variety of standard short-range NLP benchmarks, while significantly outperforming it on tasks requiring long-range dependency modeling.
