The Hidden Attention of Mamba Models
Ameen Ali, Itamar Zimerman, Lior Wolf
TL;DR
The paper tackles understanding how Mamba selective SSMs process dependencies and how their information flow compares to self-attention. It reformulates S6 as a data-controlled linear operator, derives hidden attention matrices, and repurposes transformer explainability tools for Mamba. It shows Mamba yields many more attention matrices per layer and develops Mamba-Attr and Attention Rollout variants with competitive explainability metrics on vision and NLP. Visualization and analysis demonstrate parallels and differences with transformers and highlight potential for weakly supervised tasks and debugging.
Abstract
The Mamba layer offers an efficient selective state space model (SSM) that is highly effective in modeling multiple domains, including NLP, long-range sequence processing, and computer vision. Selective SSMs are viewed as dual models, in which one trains in parallel on the entire sequence via an IO-aware parallel scan, and deploys in an autoregressive manner. We add a third view and show that such models can be viewed as attention-driven models. This new perspective enables us to empirically and theoretically compare the underlying mechanisms to that of the self-attention layers in transformers and allows us to peer inside the inner workings of the Mamba model with explainability methods. Our code is publicly available.
