2Mamba2Furious: Linear in Complexity, Competitive in Accuracy

Gabriel Mongaras; Eric C. Larson

2Mamba2Furious: Linear in Complexity, Competitive in Accuracy

Gabriel Mongaras, Eric C. Larson

TL;DR

A method is proposed that is nearly as accurate as softmax attention, yet much more memory efficient for long context lengths, and also investigates elements to Mamba-2 that help surpass softmax attention accuracy.

Abstract

Linear attention transformers have become a strong alternative to softmax attention due to their efficiency. However, linear attention tends to be less expressive and results in reduced accuracy compared to softmax attention. To bridge the accuracy gap between softmax attention and linear attention, we manipulate Mamba-2, a very strong linear attention variant. We first simplify Mamba-2 down to its most fundamental and important components, evaluating which specific choices make it most accurate. From this simplified Mamba variant (Mamba-2S), we improve the A-mask and increase the order of the hidden state, resulting in a method, which we call 2Mamba, that is nearly as accurate as softmax attention, yet much more memory efficient for long context lengths. We also investigate elements to Mamba-2 that help surpass softmax attention accuracy. Code is provided for all our experiments

2Mamba2Furious: Linear in Complexity, Competitive in Accuracy

TL;DR

Abstract

Paper Structure (22 sections, 12 equations, 11 figures, 3 tables, 5 algorithms)

This paper contains 22 sections, 12 equations, 11 figures, 3 tables, 5 algorithms.

Introduction
Background
Softmax Attention
Linear attention (naive computation)
Mamba-2
Softmax as an RNN
Isolating Mamba-2 Accuracy Gains
Building Up to the Mamba-2S Base Model
Mamba-2 with a Squared Hidden State
2Mamba Algorithm Efficiency
Effective Context Usage
2Mamba With an Exponentiated Hidden State
Conclusion and Future Work
Model Ablation Details
Gradients
...and 7 more sections

Figures (11)

Figure 1: Accuracy of linear attention, Mamba, and softmax attention, keeping everything but the attention mechanism constant across experiments.
Figure 2: Accuracy of various norm types. Softmax normalization requires a positive inner-product space image, as such we use ReLU.
Figure 3: Isolated Mamba ablation
Figure 4: Investigations of major and minor build ups in constructing the simplified Mamba-2 architecture.
Figure 5: Experimental results comparing accuracy and training stability.
...and 6 more figures

2Mamba2Furious: Linear in Complexity, Competitive in Accuracy

TL;DR

Abstract

2Mamba2Furious: Linear in Complexity, Competitive in Accuracy

Authors

TL;DR

Abstract

Table of Contents

Figures (11)