Table of Contents
Fetching ...

Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers

Sukjun Hwang, Aakash Lahoti, Tri Dao, Albert Gu

TL;DR

The paper introduces a Matrix Mixer Framework that unifies sequence models (including Transformers and structured state-space models like Mamba) as linear mixers over input sequences. It identifies sequence alignment (SAM) as a key axis that enables data-dependent parameterization and extendability, and uses this insight to develop Hydra, a bidirectional extension of Mamba parameterized as a quasiseparable matrix mixer with sub-quadratic computation. Hydra achieves state-of-the-art results on GLUE (average accuracy $84.3\%$) and ImageNet-1K (Top-1 $81.0\%$), outperforming BERT and ViT in respective tasks. The work provides a systematic framework for designing new sequence mixers, demonstrates the expressivity and efficiency benefits of SAM and QS parameterizations, and releases code and pretrained weights for reproducibility and broader use.

Abstract

A wide array of sequence models are built on a framework modeled after Transformers, comprising alternating sequence mixer and channel mixer layers. This paper studies a unifying matrix mixer view of sequence mixers that can be conceptualized as a linear map on the input sequence. This framework encompasses a broad range of well-known sequence models, including the self-attention of Transformers as well as recent strong alternatives such as structured state space models (SSMs), and allows understanding downstream characteristics such as efficiency and expressivity through properties of their structured matrix class. We identify a key axis of matrix parameterizations termed sequence alignment, which increases the flexibility and performance of matrix mixers, providing insights into the strong performance of Transformers and recent SSMs such as Mamba. Furthermore, the matrix mixer framework offers a systematic approach to developing sequence mixers with desired properties, allowing us to develop several new sub-quadratic sequence models. In particular, we propose a natural bidirectional extension of the Mamba model (Hydra), parameterized as a quasiseparable matrix mixer, which demonstrates superior performance over other sequence models including Transformers on non-causal tasks. As a drop-in replacement for attention layers, Hydra outperforms BERT by 0.8 points on the GLUE benchmark and ViT by 2% Top-1 accuracy on ImageNet.

Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers

TL;DR

The paper introduces a Matrix Mixer Framework that unifies sequence models (including Transformers and structured state-space models like Mamba) as linear mixers over input sequences. It identifies sequence alignment (SAM) as a key axis that enables data-dependent parameterization and extendability, and uses this insight to develop Hydra, a bidirectional extension of Mamba parameterized as a quasiseparable matrix mixer with sub-quadratic computation. Hydra achieves state-of-the-art results on GLUE (average accuracy ) and ImageNet-1K (Top-1 ), outperforming BERT and ViT in respective tasks. The work provides a systematic framework for designing new sequence mixers, demonstrates the expressivity and efficiency benefits of SAM and QS parameterizations, and releases code and pretrained weights for reproducibility and broader use.

Abstract

A wide array of sequence models are built on a framework modeled after Transformers, comprising alternating sequence mixer and channel mixer layers. This paper studies a unifying matrix mixer view of sequence mixers that can be conceptualized as a linear map on the input sequence. This framework encompasses a broad range of well-known sequence models, including the self-attention of Transformers as well as recent strong alternatives such as structured state space models (SSMs), and allows understanding downstream characteristics such as efficiency and expressivity through properties of their structured matrix class. We identify a key axis of matrix parameterizations termed sequence alignment, which increases the flexibility and performance of matrix mixers, providing insights into the strong performance of Transformers and recent SSMs such as Mamba. Furthermore, the matrix mixer framework offers a systematic approach to developing sequence mixers with desired properties, allowing us to develop several new sub-quadratic sequence models. In particular, we propose a natural bidirectional extension of the Mamba model (Hydra), parameterized as a quasiseparable matrix mixer, which demonstrates superior performance over other sequence models including Transformers on non-causal tasks. As a drop-in replacement for attention layers, Hydra outperforms BERT by 0.8 points on the GLUE benchmark and ViT by 2% Top-1 accuracy on ImageNet.
Paper Structure (45 sections, 10 theorems, 15 equations, 12 figures, 9 tables)

This paper contains 45 sections, 10 theorems, 15 equations, 12 figures, 9 tables.

Key Result

Proposition 2.3

Sequence aligned matrices exhibit canonical data-dependent parameterization.

Figures (12)

  • Figure 1: (Left) A schematic of the matrix mixer framework. (Right) An overview of matrix mixer classes: dense, Vandermonde, Toeplitz, low-rank, semiseparable, and quasiseparable.
  • Figure 2: (a) A semiseparable (SS) matrix. (b) A quasiseparable (QS) matrix. (c) A mixer matrix of addition-based bidirectional SSMs. (d) A QS mixer matrix for Hydra. SS and QS matrices are characterized by rank conditions (\ref{['def: semiseparable']}, \ref{['def: quasiseparable']}). The rank characterization of SS matrices include the diagonals (e.g., green submatrices), whereas that of QS matrices hold for off-diagonal submatrices (e.g., yellow submatrices). Because of the similar rank properties, a naive addition-based bidirectional SSM is provably a QS matrix mixer. Hence, QS matrix mixers generalize this common heuristic for bidirectional SSMs. The freedom in the diagonal values of Hydra leads to a higher expressivity compared to the mixer matrices of the addition-based bidirectional SSMs, where the diagonal values are constrained by the colored vectors.
  • Figure 3: Cross-entropy loss of various bidirectional variants, measured on the C4 validation set.
  • Figure 4: Detailed illustration of Hydra.
  • Figure 5: Pseudo code for Hydra. $B, L, H, P$ denote batch size, sequence length, number of heads, and head dimension respectively. The suffices _f and _b denote forward and backward. shift: Right-shift.
  • ...and 7 more figures

Theorems & Definitions (21)

  • Definition 2.1: The matrix mixer framework
  • Definition 2.2: Sequence Aligned Matrices
  • Proposition 2.3: Data Dependency
  • proof
  • Proposition 2.4: Extendability
  • proof
  • Proposition 2.5
  • Proposition 2.6
  • Proposition 2.7
  • Definition 3.1: The rank characterization of semiseparable matrices
  • ...and 11 more