Table of Contents
Fetching ...

Expansion Span: Combining Fading Memory and Retrieval in Hybrid State Space Models

Elvis Nunez, Luca Zancato, Benjamin Bowman, Aditya Golatkar, Wei Xia, Stefano Soatto

TL;DR

The paper addresses the limited recall in Hybrid State Space Models by introducing Span-Expanded Attention (SE-Attn), which reserves an expansion span for retrieval of relevant tokens from the distant past. SE-Attn integrates memory retrieval directly into the attention mechanism and is trained with HyLoRA, a LoRA-based fine-tuning approach tailored for Hybrid models that also tunes 1D convolution layers. Empirical results on Mamba-2-Hybrid, Llama1, and Zamba2-Hybrid show SE-Attn + HyLoRA extends effective context up to $8\times$ the pre-training size and often matches or surpasses Full-Attn on long-context tasks like RULER and PG-19, while outperforming efficient attention baselines. The work demonstrates a scalable, cost-efficient path to long-context understanding, revealing that perplexity alone is not a reliable long-context proxy and highlighting the practical impact for tasks requiring long-range memory.

Abstract

The "state" of State Space Models (SSMs) represents their memory, which fades exponentially over an unbounded span. By contrast, Attention-based models have "eidetic" (i.e., verbatim, or photographic) memory over a finite span (context size). Hybrid architectures combine State Space layers with Attention, but still cannot recall the distant past and can access only the most recent tokens eidetically. Unlike current methods of combining SSM and Attention layers, we allow the state to be allocated based on relevancy rather than recency. In this way, for every new set of query tokens, our models can "eidetically" access tokens from beyond the Attention span of current Hybrid SSMs without requiring extra hardware resources. We introduce a method to expand the memory span of the hybrid state by "reserving" a fraction of the Attention context for tokens retrieved from arbitrarily distant in the past, thus expanding the eidetic memory span of the overall state. We call this reserved fraction of tokens the "expansion span," and the mechanism to retrieve and aggregate it "Span-Expanded Attention" (SE-Attn). To adapt Hybrid models to using SE-Attn, we propose a novel fine-tuning method that extends LoRA to Hybrid models (HyLoRA) and allows efficient adaptation on long spans of tokens. We show that SE-Attn enables us to efficiently adapt pre-trained Hybrid models on sequences of tokens up to 8 times longer than the ones used for pre-training. We show that HyLoRA with SE-Attn is cheaper and more performant than alternatives like LongLoRA when applied to Hybrid models on natural language benchmarks with long-range dependencies, such as PG-19, RULER, and other common natural language downstream tasks.

Expansion Span: Combining Fading Memory and Retrieval in Hybrid State Space Models

TL;DR

The paper addresses the limited recall in Hybrid State Space Models by introducing Span-Expanded Attention (SE-Attn), which reserves an expansion span for retrieval of relevant tokens from the distant past. SE-Attn integrates memory retrieval directly into the attention mechanism and is trained with HyLoRA, a LoRA-based fine-tuning approach tailored for Hybrid models that also tunes 1D convolution layers. Empirical results on Mamba-2-Hybrid, Llama1, and Zamba2-Hybrid show SE-Attn + HyLoRA extends effective context up to the pre-training size and often matches or surpasses Full-Attn on long-context tasks like RULER and PG-19, while outperforming efficient attention baselines. The work demonstrates a scalable, cost-efficient path to long-context understanding, revealing that perplexity alone is not a reliable long-context proxy and highlighting the practical impact for tasks requiring long-range memory.

Abstract

The "state" of State Space Models (SSMs) represents their memory, which fades exponentially over an unbounded span. By contrast, Attention-based models have "eidetic" (i.e., verbatim, or photographic) memory over a finite span (context size). Hybrid architectures combine State Space layers with Attention, but still cannot recall the distant past and can access only the most recent tokens eidetically. Unlike current methods of combining SSM and Attention layers, we allow the state to be allocated based on relevancy rather than recency. In this way, for every new set of query tokens, our models can "eidetically" access tokens from beyond the Attention span of current Hybrid SSMs without requiring extra hardware resources. We introduce a method to expand the memory span of the hybrid state by "reserving" a fraction of the Attention context for tokens retrieved from arbitrarily distant in the past, thus expanding the eidetic memory span of the overall state. We call this reserved fraction of tokens the "expansion span," and the mechanism to retrieve and aggregate it "Span-Expanded Attention" (SE-Attn). To adapt Hybrid models to using SE-Attn, we propose a novel fine-tuning method that extends LoRA to Hybrid models (HyLoRA) and allows efficient adaptation on long spans of tokens. We show that SE-Attn enables us to efficiently adapt pre-trained Hybrid models on sequences of tokens up to 8 times longer than the ones used for pre-training. We show that HyLoRA with SE-Attn is cheaper and more performant than alternatives like LongLoRA when applied to Hybrid models on natural language benchmarks with long-range dependencies, such as PG-19, RULER, and other common natural language downstream tasks.

Paper Structure

This paper contains 31 sections, 3 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Span-Expanded Attention (SE-Attn) overview. SE-Attn is a Sparse Attention mechanism used to expand the memory span of Hybrid SSMs. Left: SE-Attn works by reserving a fraction of the Attention context for tokens retrieved arbitrarily far back in the past. We call this reserve the "expansion span," and we populate it with blocks of previous tokens ("memory blocks"). When new tokens arrive, a similarity-based search compares the queries with past memory blocks—represented as summary tokens—to retrieve relevant memory blocks. Then, these retrieved memory blocks are jointly processed with the queries via Attention. While the final Attention mechanism always processes a fixed number of tokens, it can have a longer span since it retrieves tokens from arbitrarily far back in the past. Right: Retrieving tokens from the past yields a sparse Attention pattern.
  • Figure 2: Fine-tuning with SE-Attn outperforms SW-Attn and $S^2$-Attn on the RULER benchmark. HyLoRA outperforms LoRA and LoRA+ on Hybrid models. (a): We fine-tune Mamba-2-Hybrid with a context size of 8192 and evaluate on eleven RULER tasks, as explained in \ref{['sec:ruler_aggregation']}. Fine-tuning with SE-Attn consistently outperforms SW-Attn and $S^2$-Attn even when evaluating on context sizes beyond the fine-tuning size. (b): We fine-tune Mamba-2-Hybrid with SE-Attn using LoRA, LoRA+, and HyLoRA. LoRA and LoRA+ perform sub-optimally. Our HyLoRA additionally trains the 1D convolution layers and yields strong performance.
  • Figure 3: SE-Attn ablations on Mamba-2-Hybrid. (a): Attention-based memory retrieval (SE-Attn) improves upon no retrieval and random retrieval. (b): Using SE-Attn with a chunk size chosen randomly from $\{2048, 4096\}$ acts as a regularizer and outperforms SE-Attn with fixed chunk sizes of 2048 and 4096. (c): SE-Attn with larger memory blocks (i.e., more tokens per block) with a smaller top-$k$ tends to do better than smaller blocks with a larger top-$k$. (d): An expansion span consisting of 256 total tokens (8 memory blocks with 32 tokens in each) gives the strongest performance.
  • Figure 4: SE-Attn offers a greater runtime-memory trade-off than other Attention variants. We profile various attention layers during a single training step. SW-Attn uses a window size of 4096 and SE-Attn alternates between a chunk size of 2048 and 4096. (a): Runtime of a single training step. (b): Peak GPU memory used during the training step. (c): Runtime vs. memory used. Points on the lower left of the plot exhibit a stronger runtime-memory trade-off.
  • Figure 5: HyLoRA outperforms LoRA and LoRA+ on Hybrid models. We fine-tune Mamba-2-Hybrid with Full-Attn using LoRA, LoRA+, and HyLoRA. We find that LoRA and LoRA+ perform sub-optimally compared to HyLoRA which also trains 1D convolution layers.
  • ...and 8 more figures