Table of Contents
Fetching ...

The Hidden Attention of Mamba Models

Ameen Ali, Itamar Zimerman, Lior Wolf

TL;DR

The paper tackles understanding how Mamba selective SSMs process dependencies and how their information flow compares to self-attention. It reformulates S6 as a data-controlled linear operator, derives hidden attention matrices, and repurposes transformer explainability tools for Mamba. It shows Mamba yields many more attention matrices per layer and develops Mamba-Attr and Attention Rollout variants with competitive explainability metrics on vision and NLP. Visualization and analysis demonstrate parallels and differences with transformers and highlight potential for weakly supervised tasks and debugging.

Abstract

The Mamba layer offers an efficient selective state space model (SSM) that is highly effective in modeling multiple domains, including NLP, long-range sequence processing, and computer vision. Selective SSMs are viewed as dual models, in which one trains in parallel on the entire sequence via an IO-aware parallel scan, and deploys in an autoregressive manner. We add a third view and show that such models can be viewed as attention-driven models. This new perspective enables us to empirically and theoretically compare the underlying mechanisms to that of the self-attention layers in transformers and allows us to peer inside the inner workings of the Mamba model with explainability methods. Our code is publicly available.

The Hidden Attention of Mamba Models

TL;DR

The paper tackles understanding how Mamba selective SSMs process dependencies and how their information flow compares to self-attention. It reformulates S6 as a data-controlled linear operator, derives hidden attention matrices, and repurposes transformer explainability tools for Mamba. It shows Mamba yields many more attention matrices per layer and develops Mamba-Attr and Attention Rollout variants with competitive explainability metrics on vision and NLP. Visualization and analysis demonstrate parallels and differences with transformers and highlight potential for weakly supervised tasks and debugging.

Abstract

The Mamba layer offers an efficient selective state space model (SSM) that is highly effective in modeling multiple domains, including NLP, long-range sequence processing, and computer vision. Selective SSMs are viewed as dual models, in which one trains in parallel on the entire sequence via an IO-aware parallel scan, and deploys in an autoregressive manner. We add a third view and show that such models can be viewed as attention-driven models. This new perspective enables us to empirically and theoretically compare the underlying mechanisms to that of the self-attention layers in transformers and allows us to peer inside the inner workings of the Mamba model with explainability methods. Our code is publicly available.
Paper Structure (17 sections, 7 theorems, 36 equations, 7 figures, 1 table)

This paper contains 17 sections, 7 theorems, 36 equations, 7 figures, 1 table.

Key Result

theorem thmcountertheorem

(i) S4 gu2021efficiently, DSS dss, S5 smith2022simplified have fixed mixing elements. (ii) GSS gss,and Hyena poli2023hyena have fixed mixing elements with diagonal data-control mechanism. (iii) Selective SSM have data-controlled non-diagonal mixers.

Figures (7)

  • Figure 1: Three Perspectives of the Selective State-Space Layer:(Left) Selective State-Space Models (SSMs) can be efficiently computed with linear complexity using parallel scans, allowing for effective parallelization on modern hardware, such as GPUs. (Middle) Similar to SSMs, the selective state-space layer can be computed via a time-variant recurrent rule. (Right) A new view of the selective SSM layer, showing that it uses attention similarly to transformers (see Eq. \ref{['eq:MAMbaASmatmul']}). Our view enables the generation of attention maps, offering valuable applications in areas such as XAI.
  • Figure 2: Comperative Visualization of Transformer-Attribution and our Mamba-Attribution, both class specific methods.
  • Figure 3: Average attention maps for CLS token in the middle (a,b,c) and as the first (d,e,f).
  • Figure 4: Hidden Attention Matrices: Attention matrices in vision and NLP Models.Each row represents a different layer within the models, showcasing the evolution of the attention matrices at 25% (top), 50%, and 75% (bottom) of the layer depth.
  • Figure 5: Qualitative results for the different explanation methods for the ViT-small and the Mamba-small models. (a) the original image, (b) the aggregated Raw-Attention of ViT-Small, (c) Attention Rollout for ViT-Small, (d) Transformer-Attribution for ViT-Small, (e) the Raw-Attention of Mamba-Small, (f) Attention-Rollout of Mamba-Small and (g) the Mamba-Attribution method for the Mamba-Small model.
  • ...and 2 more figures

Theorems & Definitions (13)

  • theorem thmcountertheorem
  • theorem thmcountertheorem
  • theorem thmcountertheorem
  • proof
  • definition thmcounterdefinition
  • lemma thmcounterlemma
  • proof
  • lemma thmcounterlemma
  • proof
  • lemma thmcounterlemma
  • ...and 3 more