Table of Contents
Fetching ...

Taipan: Efficient and Expressive State Space Language Models with Selective Attention

Chien Van Nguyen, Huy Huu Nguyen, Thang M. Pham, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Ryan A. Rossi, Trung Bui, Viet Dac Lai, Franck Dernoncourt, Thien Huu Nguyen

TL;DR

Taipan is introduced, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs) that balances Mamba's efficiency with Transformer-like performance in memory-intensive tasks.

Abstract

Efficient long-context language modeling remains a significant challenge in Natural Language Processing (NLP). While Transformers dominate language tasks, they struggle with long sequences due to quadratic computational complexity in training and linearly scaling memory costs during inference. Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs). These SALs identify tokens requiring long-range interactions, remove less important features, and then augment their representations using the attention module. This approach balances Mamba's efficiency with Transformer-like performance in memory-intensive tasks. By constraining the attention budget, Taipan extends accurate predictions to context lengths of up to 1 million tokens while preserving computational efficiency. Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.

Taipan: Efficient and Expressive State Space Language Models with Selective Attention

TL;DR

Taipan is introduced, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs) that balances Mamba's efficiency with Transformer-like performance in memory-intensive tasks.

Abstract

Efficient long-context language modeling remains a significant challenge in Natural Language Processing (NLP). While Transformers dominate language tasks, they struggle with long sequences due to quadratic computational complexity in training and linearly scaling memory costs during inference. Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs). These SALs identify tokens requiring long-range interactions, remove less important features, and then augment their representations using the attention module. This approach balances Mamba's efficiency with Transformer-like performance in memory-intensive tasks. By constraining the attention budget, Taipan extends accurate predictions to context lengths of up to 1 million tokens while preserving computational efficiency. Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.

Paper Structure

This paper contains 20 sections, 14 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Model Performance Comparison. a) Perplexity across different context lengths. Lower perplexity indicates better performance. b) Latency comparison of models at various generation lengths. Taipan exhibits significantly lower latency and superior scaling compared to other strong baselines for longer sequences.
  • Figure 2: An overview of the Taipan architecture.
  • Figure 3: Attention mechanisms in Taipan's Selective Attention Layers. White areas indicate no attention. (a) Full Causal Attention (b) Sliding Window Attention ($w = 4$) (c) Selective Attention ($C= 0.3$, $w = 5$)
  • Figure 4: Performance on in-context retrieval tasks.
  • Figure 5: Effect of Attention Budget Capacity $C$ on Taipan's Performance
  • ...and 1 more figures