Table of Contents
Fetching ...

Revealing and Mitigating the Local Pattern Shortcuts of Mamba

Wangjie You, Zecheng Tang, Juntao Li, Lili Yao, Min Zhang

TL;DR

This work introduces a global selection module into the Mamba model to address Mamba's reliance on local pattern shortcuts, which enable the model to remember local key information within its limited memory but hinder its ability to retain more dispersed information.

Abstract

Large language models (LLMs) have advanced significantly due to the attention mechanism, but their quadratic complexity and linear memory demands limit their performance on long-context tasks. Recently, researchers introduced Mamba, an advanced model built upon State Space Models(SSMs) that offers linear complexity and constant memory. Although Mamba is reported to match or surpass the performance of attention-based models, our analysis reveals a performance gap: Mamba excels in tasks that involve localized key information but faces challenges with tasks that require handling distributed key information. Our controlled experiments suggest that this inconsistency arises from Mamba's reliance on local pattern shortcuts, which enable the model to remember local key information within its limited memory but hinder its ability to retain more dispersed information. Therefore, we introduce a global selection module into the Mamba model to address this issue. Experiments on both existing and proposed synthetic tasks, as well as real-world tasks, demonstrate the effectiveness of our method. Notably, with the introduction of only 4M extra parameters, our approach enables the Mamba model(130M) to achieve a significant improvement on tasks with distributed information, increasing its performance from 0 to 80.54 points.

Revealing and Mitigating the Local Pattern Shortcuts of Mamba

TL;DR

This work introduces a global selection module into the Mamba model to address Mamba's reliance on local pattern shortcuts, which enable the model to remember local key information within its limited memory but hinder its ability to retain more dispersed information.

Abstract

Large language models (LLMs) have advanced significantly due to the attention mechanism, but their quadratic complexity and linear memory demands limit their performance on long-context tasks. Recently, researchers introduced Mamba, an advanced model built upon State Space Models(SSMs) that offers linear complexity and constant memory. Although Mamba is reported to match or surpass the performance of attention-based models, our analysis reveals a performance gap: Mamba excels in tasks that involve localized key information but faces challenges with tasks that require handling distributed key information. Our controlled experiments suggest that this inconsistency arises from Mamba's reliance on local pattern shortcuts, which enable the model to remember local key information within its limited memory but hinder its ability to retain more dispersed information. Therefore, we introduce a global selection module into the Mamba model to address this issue. Experiments on both existing and proposed synthetic tasks, as well as real-world tasks, demonstrate the effectiveness of our method. Notably, with the introduction of only 4M extra parameters, our approach enables the Mamba model(130M) to achieve a significant improvement on tasks with distributed information, increasing its performance from 0 to 80.54 points.

Paper Structure

This paper contains 46 sections, 7 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Mamba exhibits two distinct trends under different settings. The y-axis represents accuracy, while the x-axis in Fig.(a) shows the number of key-value pairs in the context with a testing length of 4K. In Fig.(b), the x-axis represents the testing length.
  • Figure 2: Illustration of Mqar Task.
  • Figure 3: $\textsc{Mqar}$ task with different positional patterns.
  • Figure 4: The attention-like matrices of Mamba-130M that are trained on standard Mqar task and are tested on all three testing sets, i.e., Standard, Last, and Shuffle. We plot the results of the 22nd layer of the model. Lighter colors indicate higher attention scores at specific positions. The red dashed line represents the location of the key-value pairs, while the yellow dashed line indicates where the model attends to the most.
  • Figure 5: Mqar task with different n-gram patterns.
  • ...and 6 more figures