Table of Contents
Fetching ...

RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling

Xiuying Wei, Anunay Yadav, Razvan Pascanu, Caglar Gulcehre

TL;DR

RAT introduces a chunk-based temporal mixing layer that sits between RNNs and full self-attention, dividing sequences into chunks of length $L$ and applying intra-chunk recurrence alongside inter-chunk softmax attention to enable long-range retrieval with reduced computation. By tuning $L$, RAT interpolates between recurrence and attention, and a hybrid variant interleaves RAT with sliding-window attention to leverage strong local interactions. Extensive experiments on 1.3B-parameter models pretrained on 100B tokens show RAT with $L=16$ achieves substantial speedups (up to $\sim$7–10×) while maintaining competitive accuracy across short- and long-context benchmarks, and RAT-SWA often yields state-of-the-art results on long-context tasks. The work also provides thorough analyses of efficiency, ablations, and length-generalization strategies (RoPE/NoPE), highlighting RAT’s potential for scalable, efficient long-context language modeling and signaling avenues for future scaling and optimization.

Abstract

Transformers have become the cornerstone of modern large-scale language models, but their reliance on softmax attention poses a computational bottleneck at both training and inference. Recurrent models offer high efficiency, but compressing the full sequence into a fixed-size and holistic representation can suffer from memory degradation in long contexts and limit fine-grained retrieval. To address this, we propose RAT, an intermediate design that bridges the efficiency of RNNs and capacity of attention. RAT partitions the input into chunks, applies recurrence within each chunk for local dependencies, and softmax-based attention across chunks for long-range interactions. This design mitigates memory degradation and enables direct access to distant tokens, while retaining computational efficiency. Empirically, with a chunk size of 16, the RAT block achieves a 7$\times$ improvement in training speed for 100K sequence length and 9$times$ in generation at the 4K position, while maintaining similar performance compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves RAT with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage, but also consistently enhances performance and shows the overall best results. Code is available at https://github.com/CLAIRE-Labo/RAT.

RAT: Bridging RNN Efficiency and Attention Accuracy via Chunk-based Sequence Modeling

TL;DR

RAT introduces a chunk-based temporal mixing layer that sits between RNNs and full self-attention, dividing sequences into chunks of length and applying intra-chunk recurrence alongside inter-chunk softmax attention to enable long-range retrieval with reduced computation. By tuning , RAT interpolates between recurrence and attention, and a hybrid variant interleaves RAT with sliding-window attention to leverage strong local interactions. Extensive experiments on 1.3B-parameter models pretrained on 100B tokens show RAT with achieves substantial speedups (up to 7–10×) while maintaining competitive accuracy across short- and long-context benchmarks, and RAT-SWA often yields state-of-the-art results on long-context tasks. The work also provides thorough analyses of efficiency, ablations, and length-generalization strategies (RoPE/NoPE), highlighting RAT’s potential for scalable, efficient long-context language modeling and signaling avenues for future scaling and optimization.

Abstract

Transformers have become the cornerstone of modern large-scale language models, but their reliance on softmax attention poses a computational bottleneck at both training and inference. Recurrent models offer high efficiency, but compressing the full sequence into a fixed-size and holistic representation can suffer from memory degradation in long contexts and limit fine-grained retrieval. To address this, we propose RAT, an intermediate design that bridges the efficiency of RNNs and capacity of attention. RAT partitions the input into chunks, applies recurrence within each chunk for local dependencies, and softmax-based attention across chunks for long-range interactions. This design mitigates memory degradation and enables direct access to distant tokens, while retaining computational efficiency. Empirically, with a chunk size of 16, the RAT block achieves a 7 improvement in training speed for 100K sequence length and 9 in generation at the 4K position, while maintaining similar performance compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves RAT with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage, but also consistently enhances performance and shows the overall best results. Code is available at https://github.com/CLAIRE-Labo/RAT.

Paper Structure

This paper contains 51 sections, 3 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Latency of the temporal mixing block (including linear projections) with a model dimension of 2048. (a): full-sequence latency with 200K tokens; (b): generation of 512 tokens at specified positions. We adopt flash attention for Attention.
  • Figure 2: (a) Ablation study on RAT(L=64). (b) and (c) show pretraining results on 200M and 1.3B models, respectively. RAT lies between RNN and attention in terms of pretraining perplexity.
  • Figure 3: We measure the maximum throughput of the full 1.3B model for generating 1024 tokens under different prefilling lengths. For each total sequence length $T$, the prefilling length is set to $T - 1024$. For example, $T=4096$ corresponds to a prefilling of 3072 tokens, while $T=8192$ and $T=16384$ correspond to 7168 and 15360 tokens, respectively. When the sequence length increases, the maximum throughput ratio between RAT and attention rises from $10.2\times$ to $15.6\times$, highlighting the strong efficiency advantage of RAT in long-context generation.
  • Figure 4: Evaluation at different test lengths for pretrained models trained with a 4K context window. RAT(L=16)-SWA with NoPE achieves the best overall performance, exhibiting strong generalization up to $T = 16384$ while maintaining low loss within the training context.
  • Figure :