Table of Contents
Fetching ...

Explaining Modern Gated-Linear RNNs via a Unified Implicit Attention Formulation

Itamar Zimerman, Ameen Ali, Lior Wolf

TL;DR

A unified view of attention-free layers of Transformer models, formulating such layers as implicit causal self-attention layers and showing that the attention matrices and attribution method outperform an alternative and a more limited formulation that was recently proposed for Mamba.

Abstract

Recent advances in efficient sequence modeling have led to attention-free layers, such as Mamba, RWKV, and various gated RNNs, all featuring sub-quadratic complexity in sequence length and excellent scaling properties, enabling the construction of a new type of foundation models. In this paper, we present a unified view of these models, formulating such layers as implicit causal self-attention layers. The formulation includes most of their sub-components and is not limited to a specific part of the architecture. The framework compares the underlying mechanisms on similar grounds for different layers and provides a direct means for applying explainability methods. Our experiments show that our attention matrices and attribution method outperform an alternative and a more limited formulation that was recently proposed for Mamba. For the other architectures for which our method is the first to provide such a view, our method is effective and competitive in the relevant metrics compared to the results obtained by state-of-the-art Transformer explainability methods. Our code is publicly available.

Explaining Modern Gated-Linear RNNs via a Unified Implicit Attention Formulation

TL;DR

A unified view of attention-free layers of Transformer models, formulating such layers as implicit causal self-attention layers and showing that the attention matrices and attribution method outperform an alternative and a more limited formulation that was recently proposed for Mamba.

Abstract

Recent advances in efficient sequence modeling have led to attention-free layers, such as Mamba, RWKV, and various gated RNNs, all featuring sub-quadratic complexity in sequence length and excellent scaling properties, enabling the construction of a new type of foundation models. In this paper, we present a unified view of these models, formulating such layers as implicit causal self-attention layers. The formulation includes most of their sub-components and is not limited to a specific part of the architecture. The framework compares the underlying mechanisms on similar grounds for different layers and provides a direct means for applying explainability methods. Our experiments show that our attention matrices and attribution method outperform an alternative and a more limited formulation that was recently proposed for Mamba. For the other architectures for which our method is the first to provide such a view, our method is effective and competitive in the relevant metrics compared to the results obtained by state-of-the-art Transformer explainability methods. Our code is publicly available.
Paper Structure (17 sections, 34 equations, 6 figures, 7 tables)

This paper contains 17 sections, 34 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Unified and Interpretable Formulation of Attention-Free Architectures via Attention Matrices: (Left) Schematic overview of the architectures of Mamba, Griffin, and RWKV. (Right) A new view of those layers that rely on implicit attention. Our perspective enables the generation of attention maps, offering valuable applications in areas such as Explainable AI.
  • Figure 2: Hidden Attention Matrices: Attention matrices of LLMs. Each row represents a different layer within the models, showcasing the evolution of the attention matrices at 25% (top), 50%, and 75% (bottom) of the layer depth.
  • Figure 3: Qualitative results for the different explanation methods for the ViT and ViM, both of small size. (a) The original image, (b) Raw-Attention over ViM, (c) Attention-Rollout over ViM, (d) Mamba-Attribution over ViM, (e) Raw-Attention with our proposed attention over ViM, (f) Attention-Rollout with our proposed attention over ViM, (g) Mamba-Attribution with our proposed attention over ViM, (h) Raw-Attention of ViT, (i) Attention-Rollout for ViT, (j) Transformer-Attribution for ViT. Results for columns (b), (c), and (d) are based on the method of ali2024hidden, and the ViT results on (i), (j) and (k) rely on chefer2021transformer.
  • Figure 4: Qualitative results for NLP, samples are taken from IMDB movie sentiment classification. In (a), we show the results for the previously proposed Mamba's attention ali2024hidden, (b) our proposed Mamba's attention, and in (c) we show our proposed method over RWKV. In the upper row, we show a negative sentiment, and in the lower row, we show a positive sentiment.
  • Figure 5: Comparative visualization of ablated hidden matrices. 'M' for Mamba.
  • ...and 1 more figures