Table of Contents
Fetching ...

Don't Pay Attention

Mohammad Hammoud, Devang Acharya

TL;DR

The paper tackles the Transformer bottleneck of fixed context and quadratic self-attention by introducing Avey, a ranker plus neural processor architecture that decouples sequence length from context width, enabling efficient processing of arbitrarily long sequences. It demonstrates that Avey can match or exceed Transformer performance on short-range tasks and markedly outperform it on long-range dependency benchmarks, thanks to a weighted-selective-split mechanism and dynamic contextualization across splits. Through extensive ablations and a cascaded design-search, the authors show the importance of embedding expansion, partial-embedding bypassing, and ranker-guided contextualization, while revealing strong extrapolation capabilities on S-NIAH benchmarks. The work suggests a scalable route for language modeling with practical latency benefits and provides open-source code and pretrained checkpoints for reproducibility and further research.

Abstract

The Transformer has become the de facto standard for modern language models owing to its parallelizable training and effective autoregressive decoding. However, its fixed context window and the quadratic time and memory costs of its self-attention mechanism remain central bottlenecks. These constraints have revived interest in recurrent architectures that scale linearly with sequence length, but at the cost of reduced parallelism. In this paper, we introduce Avey, a new foundational architecture that breaks away from both attention and recurrence. Avey pairs a ranker with an autoregressive neural processor to select and contextualize only the most relevant tokens for any given token. Specifically, it decouples sequence length from context width, thus enabling effective and efficient processing of arbitrarily long sequences. Results show that Avey compares favorably to the Transformer across a variety of standard short-range NLP benchmarks, while significantly outperforming it on tasks requiring long-range dependency modeling.

Don't Pay Attention

TL;DR

The paper tackles the Transformer bottleneck of fixed context and quadratic self-attention by introducing Avey, a ranker plus neural processor architecture that decouples sequence length from context width, enabling efficient processing of arbitrarily long sequences. It demonstrates that Avey can match or exceed Transformer performance on short-range tasks and markedly outperform it on long-range dependency benchmarks, thanks to a weighted-selective-split mechanism and dynamic contextualization across splits. Through extensive ablations and a cascaded design-search, the authors show the importance of embedding expansion, partial-embedding bypassing, and ranker-guided contextualization, while revealing strong extrapolation capabilities on S-NIAH benchmarks. The work suggests a scalable route for language modeling with practical latency benefits and provides open-source code and pretrained checkpoints for reproducibility and further research.

Abstract

The Transformer has become the de facto standard for modern language models owing to its parallelizable training and effective autoregressive decoding. However, its fixed context window and the quadratic time and memory costs of its self-attention mechanism remain central bottlenecks. These constraints have revived interest in recurrent architectures that scale linearly with sequence length, but at the cost of reduced parallelism. In this paper, we introduce Avey, a new foundational architecture that breaks away from both attention and recurrence. Avey pairs a ranker with an autoregressive neural processor to select and contextualize only the most relevant tokens for any given token. Specifically, it decouples sequence length from context width, thus enabling effective and efficient processing of arbitrarily long sequences. Results show that Avey compares favorably to the Transformer across a variety of standard short-range NLP benchmarks, while significantly outperforming it on tasks requiring long-range dependency modeling.

Paper Structure

This paper contains 35 sections, 9 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: Needle-in-a-Haystack test performance comparison between Transformer++, Mamba, RWKV-7, and Avey, all using 1.5B parameters. The x-axis denotes the lengths of haystacks (i.e., documents with distractor texts, varying from 2k to 64k tokens) and the y-axis refers to the position of the needle (i.e., a short sentence) within each of the haystacks. A green cell indicates successful needle recall, while a red cell indicates failure. Transformer++, Mamba, and RWKV-7 were trained with 2k-token context windows, whereas Avey was trained with only a 512-token window yet was able to extrapolate to the longest sequences evaluated.
  • Figure 2: The ranker (left) partitions each input sequence into equal-sized splits and identifies the top $k$ most relevant ones (e.g., splits 1 and 3 for $k=2$) with respect to the current split (e.g., split 4), using the MaxSim operator. These top-$k$ splits are then weighted by their normalized scores, where the normalized score (NS) of a split is computed as the ratio of its MaxSim value to the highest MaxSim score among the $k$ splits. Finally, the weighted top-$k$ splits are contextualized together with the current split by the neural processor (right).
  • Figure 3: The neural processor (top) with its three major components, the enricher, contextualizer (Cx), and fuser. The processor is unfolded into two copies for illustrative purposes only, to show how different embeddings, (e.g., $e_1$ and $e_2$, or more precisely, parts of their tails, i.e., $e_{122}$ and $e_{222}$) are contextualized by Cx (i.e., in reality, all components are shared across all embeddings and many embeddings can be input to Cx simultaneously).
  • Figure 4: The Time to First Token (TTFT) for Avey, Transformer++, Mamba, and RWKV-7 across varying sequence lengths.
  • Figure 5: Performance comparison between Transformer++, Mamba, RWKV-7, and Avey on S-NIAH-1 and S-NIAH-2. The x-axis denotes the lengths of haystacks (i.e., documents with distractor texts, varying from 2k to 64k tokens). All models use 0.5B parameters. Similar results are shown in Appendix \ref{['sec:extra_long_range_results']} for other model sizes.
  • ...and 3 more figures