Short window attention enables long-term memorization
Loïc Cabannes, Maximilian Beck, Gergely Szilvasy, Matthijs Douze, Maria Lomeli, Jade Copet, Pierre-Emmanuel Mazaré, Gabriel Synnaeve, Hervé Jégou
TL;DR
SWAX investigates hybrid architectures that combine sliding-window softmax attention with linear memory (xLSTM) to tackle long-context modeling under fixed compute. The study reveals that shorter sliding windows can actually improve long-range recall by forcing linear memory components to learn long-term dependencies, and it introduces a stochastic window-size training regime to blend short- and long-context strengths. Across 1.4B and 7B models and multiple benchmarks, stochastic window training yields superior short-context perplexity and competitive to superior long-context recall compared with fixed-window baselines. The findings offer a practical, scalable approach to robust length extrapolation in memory-constrained transformer hybrids, with broader applicability to other linear-attention variants such as Gated DeltaNet.
Abstract
Recent works show that hybrid architectures combining sliding window softmax attention layers with linear recurrent neural network (RNN) layers outperform both of these architectures taken separately. However, the impact of the window length and the interplay between softmax attention and linear RNN layers remain under-studied. In this work, we introduce SWAX, a hybrid architecture consisting of sliding-window attention and xLSTM linear RNN layers. A counter-intuitive finding with SWAX is that larger sliding windows do not improve the long-context performance. In fact, short window attention encourages the model to better train the long-term memory of the xLSTM, by relying less on the softmax attention mechanism for long context-retrieval. The issue with small sliding windows is that they are detrimental for short-context tasks, which could be solved with information from moderately larger sliding windows otherwise. Therefore, we train SWAX by stochastically changing the sliding window size, forcing the model to leverage both a longer context window and the xLSTM memory. SWAX trained with stochastic window sizes significantly outperforms regular window attention both on short and long-context problems.
