Table of Contents
Fetching ...

Retrievit: In-context Retrieval Capabilities of Transformers, State Space Models, and Hybrid Architectures

Georgios Pantazopoulos, Malvina Nikandrou, Ioannis Konstas, Alessandro Suglia

TL;DR

It is found that hybrid models outperform SSMs and match or exceed Transformers in data efficiency and extrapolation for information-dense context retrieval, however, Transformers maintain superiority in position retrieval tasks.

Abstract

Transformers excel at in-context retrieval but suffer from quadratic complexity with sequence length, while State Space Models (SSMs) offer efficient linear-time processing but have limited retrieval capabilities. We investigate whether hybrid architectures combining Transformers and SSMs can achieve the best of both worlds on two synthetic in-context retrieval tasks. The first task, n-gram retrieval, requires the model to identify and reproduce an n-gram that succeeds the query within the input sequence. The second task, position retrieval, presents the model with a single query token and requires it to perform a two-hop associative lookup: first locating the corresponding element in the sequence, and then outputting its positional index. Under controlled experimental conditions, we assess data efficiency, length generalization, robustness to out of domain training examples, and learned representations across Transformers, SSMs, and hybrid architectures. We find that hybrid models outperform SSMs and match or exceed Transformers in data efficiency and extrapolation for information-dense context retrieval. However, Transformers maintain superiority in position retrieval tasks. Through representation analysis, we discover that SSM-based models develop locality-aware embeddings where tokens representing adjacent positions become neighbors in embedding space, forming interpretable structures. This emergent property, absent in Transformers, explains both the strengths and limitations of SSMs and hybrids for different retrieval tasks. Our findings provide principled guidance for architecture selection based on task requirements and reveal fundamental differences in how Transformers and SSMs, and hybrid models learn positional associations.

Retrievit: In-context Retrieval Capabilities of Transformers, State Space Models, and Hybrid Architectures

TL;DR

It is found that hybrid models outperform SSMs and match or exceed Transformers in data efficiency and extrapolation for information-dense context retrieval, however, Transformers maintain superiority in position retrieval tasks.

Abstract

Transformers excel at in-context retrieval but suffer from quadratic complexity with sequence length, while State Space Models (SSMs) offer efficient linear-time processing but have limited retrieval capabilities. We investigate whether hybrid architectures combining Transformers and SSMs can achieve the best of both worlds on two synthetic in-context retrieval tasks. The first task, n-gram retrieval, requires the model to identify and reproduce an n-gram that succeeds the query within the input sequence. The second task, position retrieval, presents the model with a single query token and requires it to perform a two-hop associative lookup: first locating the corresponding element in the sequence, and then outputting its positional index. Under controlled experimental conditions, we assess data efficiency, length generalization, robustness to out of domain training examples, and learned representations across Transformers, SSMs, and hybrid architectures. We find that hybrid models outperform SSMs and match or exceed Transformers in data efficiency and extrapolation for information-dense context retrieval. However, Transformers maintain superiority in position retrieval tasks. Through representation analysis, we discover that SSM-based models develop locality-aware embeddings where tokens representing adjacent positions become neighbors in embedding space, forming interpretable structures. This emergent property, absent in Transformers, explains both the strengths and limitations of SSMs and hybrids for different retrieval tasks. Our findings provide principled guidance for architecture selection based on task requirements and reveal fundamental differences in how Transformers and SSMs, and hybrid models learn positional associations.
Paper Structure (47 sections, 1 equation, 22 figures, 2 tables)

This paper contains 47 sections, 1 equation, 22 figures, 2 tables.

Figures (22)

  • Figure 1: Overview of the synthetic in-context retrieval tasks. Left: For the task of n-gram retrieval, the model accepts a sequence containing a query n-gram (e.g., $n=2$, ) and must produce the $k$ tokens following in the sequence (e.g., $k=2$, ). Right: For the task of position retrieval, the model accepts a sequence with a single query token ( ), and must output the positional index of that token in the sequence (here, 3). Position indices are represented as dedicated vocabulary tokens distinct from regular input tokens.
  • Figure 2: Illustration of hybrid architectures. Left: Interleaved Mamba and Transformer blocks. A Transformer block is inserted every $N$ Mamba blocks. Right: Two stream block with a gate mechanism. The outputs from both streams are fused with a learnable gating mechanism.
  • Figure 3: (a) N-gram retrieval data efficiency. We train models to retrieve a sequence of $k=3$ tokens that follow a randomly selected n-gram ($n=2$) in a string of length $\leq 100$, and evaluate on strings of length 100. While Transformers train significantly faster than SSMs, hybrid architectures are converging even faster. (b) Effect of state space dimension. We train SSM models with different state space dimension. Both Mamba versions benefit from a higher capacity state space, their performance is still inferior to that of a standard Transformer, shown here as a violet dashed line. (c) Effect of interleaved SSM/Transformer blocks in hybrid-interleaved models (Hybrid$_I$). We explore the effect of a Transformer block after $N = \{1,2,3,4\}$ SSM blocks. A single Transformer layer complements the SSM stack by correcting the hidden state and yielding increased performance than a pure SSM model shown in green. Models with interleaving Transformer blocks after $N<4$ SSM blocks even surpass the performance of a pure Transformer.
  • Figure 4: (a) Illustration of the suffix (top) / prefix n-gram retrieval (bottom) variants. In the suffix version the query is given at the end of the input sequence, while in the prefix version the query is provided at the beginning. (b) When training with sequences $\le$ 100 Mamba2 exhibits greater generalization than a Transformer with RoPE embeddings, while hybrid models show near-perfect generalization abilities. (c) On the prefix variant, Mamba2 performs even better given the lower memory requirements of the task, while Transformers exhibit a slight performance boost.
  • Figure 5: Error rates with non unique n-gram suffix queries of a model from each family: (a) Transformer, (b) SSM: Mamba2, (c): Hybrid$_I$ with interleaved Mamba2 and Transformer blocks. Mamba2 fails to match the query n-gram to any candidate within the sequence, while a Transformer and the hybrid models maintain low miss rate even for sequences with many duplicates.
  • ...and 17 more figures