Table of Contents
Fetching ...

Short window attention enables long-term memorization

Loïc Cabannes, Maximilian Beck, Gergely Szilvasy, Matthijs Douze, Maria Lomeli, Jade Copet, Pierre-Emmanuel Mazaré, Gabriel Synnaeve, Hervé Jégou

TL;DR

SWAX investigates hybrid architectures that combine sliding-window softmax attention with linear memory (xLSTM) to tackle long-context modeling under fixed compute. The study reveals that shorter sliding windows can actually improve long-range recall by forcing linear memory components to learn long-term dependencies, and it introduces a stochastic window-size training regime to blend short- and long-context strengths. Across 1.4B and 7B models and multiple benchmarks, stochastic window training yields superior short-context perplexity and competitive to superior long-context recall compared with fixed-window baselines. The findings offer a practical, scalable approach to robust length extrapolation in memory-constrained transformer hybrids, with broader applicability to other linear-attention variants such as Gated DeltaNet.

Abstract

Recent works show that hybrid architectures combining sliding window softmax attention layers with linear recurrent neural network (RNN) layers outperform both of these architectures taken separately. However, the impact of the window length and the interplay between softmax attention and linear RNN layers remain under-studied. In this work, we introduce SWAX, a hybrid architecture consisting of sliding-window attention and xLSTM linear RNN layers. A counter-intuitive finding with SWAX is that larger sliding windows do not improve the long-context performance. In fact, short window attention encourages the model to better train the long-term memory of the xLSTM, by relying less on the softmax attention mechanism for long context-retrieval. The issue with small sliding windows is that they are detrimental for short-context tasks, which could be solved with information from moderately larger sliding windows otherwise. Therefore, we train SWAX by stochastically changing the sliding window size, forcing the model to leverage both a longer context window and the xLSTM memory. SWAX trained with stochastic window sizes significantly outperforms regular window attention both on short and long-context problems.

Short window attention enables long-term memorization

TL;DR

SWAX investigates hybrid architectures that combine sliding-window softmax attention with linear memory (xLSTM) to tackle long-context modeling under fixed compute. The study reveals that shorter sliding windows can actually improve long-range recall by forcing linear memory components to learn long-term dependencies, and it introduces a stochastic window-size training regime to blend short- and long-context strengths. Across 1.4B and 7B models and multiple benchmarks, stochastic window training yields superior short-context perplexity and competitive to superior long-context recall compared with fixed-window baselines. The findings offer a practical, scalable approach to robust length extrapolation in memory-constrained transformer hybrids, with broader applicability to other linear-attention variants such as Gated DeltaNet.

Abstract

Recent works show that hybrid architectures combining sliding window softmax attention layers with linear recurrent neural network (RNN) layers outperform both of these architectures taken separately. However, the impact of the window length and the interplay between softmax attention and linear RNN layers remain under-studied. In this work, we introduce SWAX, a hybrid architecture consisting of sliding-window attention and xLSTM linear RNN layers. A counter-intuitive finding with SWAX is that larger sliding windows do not improve the long-context performance. In fact, short window attention encourages the model to better train the long-term memory of the xLSTM, by relying less on the softmax attention mechanism for long context-retrieval. The issue with small sliding windows is that they are detrimental for short-context tasks, which could be solved with information from moderately larger sliding windows otherwise. Therefore, we train SWAX by stochastically changing the sliding window size, forcing the model to leverage both a longer context window and the xLSTM memory. SWAX trained with stochastic window sizes significantly outperforms regular window attention both on short and long-context problems.

Paper Structure

This paper contains 29 sections, 4 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Short (average score across benchmarks) vs long context performance for 1.4B xLSTM, SWA (sliding window attention) and SWAX with different sliding window sizes.
  • Figure 2: RULER Needle-In-A-Haystack accuracy of a 1.4B SWAX model with a fixed sliding window size of 2048 vs our method using a stochastic window size of 128/2048.
  • Figure 3: We compare 4 different types of architectures, including 3 hybrid architectures: (1) The transformer with vanilla self-attention (SA). Its complexity is prohibitive for long contexts lengths. (2) This is circumvented by replacing some SA layers by sliding window attention (SWA) layers gemmateam2025gemma3technicalreportopenai2025gptoss120bgptoss20bmodel. (3) xLSTM beck2024xlstmextendedlongshortterm offers a memory with unbounded time horizon, albeit not as precise as SA for handling the recent context. (4) SWAX is an hybrid architecture that includes both SWA layers and long-term memories layers, implemented with mLSTM memory cells.
  • Figure 4: RULER needle-in-a-haystack average performance on varying sequence lengths for 1.4B models. For SWA and SWAX we indicate the sliding window size after the colon.
  • Figure 5: RULER NIAH subtasks accuracy for 1.4B SWAX models with different window sizes
  • ...and 3 more figures