Table of Contents
Fetching ...

Visual Attention Exploration in Vision-Based Mamba Models

Junpeng Wang, Chin-Chia Michael Yeh, Uday Singh Saini, Mahashweta Das

TL;DR

The paper addresses the challenge of interpreting attention in vision-based Mamba models by introducing a dedicated visual analytics tool that analyzes inter-block and intra-block attention across a VMamba architecture trained on ImageNet. By extracting attention matrices per block and stage and applying dimensionality reduction, the approach reveals distinct block-level patterns, patch-level locality, and the influence of patch ordering on attention. Key contributions include a dual-view system (Scatterplot and Patch views) with modes tailored to surface global block differences and local patch relations, as well as the exploration of alternative patch orders (diagonal, Morton, spiral) that preserve spatial locality and yield comparable accuracy. The findings enhance understanding of how VMamba distributes attention across patches, inform patch-order design, and provide a framework for diagnosing and improving vision-based SSMs in latency-critical settings.

Abstract

State space models (SSMs) have emerged as an efficient alternative to transformer-based models, offering linear complexity that scales better than transformers. One of the latest advances in SSMs, Mamba, introduces a selective scan mechanism that assigns trainable weights to input tokens, effectively mimicking the attention mechanism. Mamba has also been successfully extended to the vision domain by decomposing 2D images into smaller patches and arranging them as 1D sequences. However, it remains unclear how these patches interact with (or attend to) each other in relation to their original 2D spatial location. Additionally, the order used to arrange the patches into a sequence also significantly impacts their attention distribution. To better understand the attention between patches and explore the attention patterns, we introduce a visual analytics tool specifically designed for vision-based Mamba models. This tool enables a deeper understanding of how attention is distributed across patches in different Mamba blocks and how it evolves throughout a Mamba model. Using the tool, we also investigate the impact of different patch-ordering strategies on the learned attention, offering further insights into the model's behavior.

Visual Attention Exploration in Vision-Based Mamba Models

TL;DR

The paper addresses the challenge of interpreting attention in vision-based Mamba models by introducing a dedicated visual analytics tool that analyzes inter-block and intra-block attention across a VMamba architecture trained on ImageNet. By extracting attention matrices per block and stage and applying dimensionality reduction, the approach reveals distinct block-level patterns, patch-level locality, and the influence of patch ordering on attention. Key contributions include a dual-view system (Scatterplot and Patch views) with modes tailored to surface global block differences and local patch relations, as well as the exploration of alternative patch orders (diagonal, Morton, spiral) that preserve spatial locality and yield comparable accuracy. The findings enhance understanding of how VMamba distributes attention across patches, inform patch-order design, and provide a framework for diagnosing and improving vision-based SSMs in latency-critical settings.

Abstract

State space models (SSMs) have emerged as an efficient alternative to transformer-based models, offering linear complexity that scales better than transformers. One of the latest advances in SSMs, Mamba, introduces a selective scan mechanism that assigns trainable weights to input tokens, effectively mimicking the attention mechanism. Mamba has also been successfully extended to the vision domain by decomposing 2D images into smaller patches and arranging them as 1D sequences. However, it remains unclear how these patches interact with (or attend to) each other in relation to their original 2D spatial location. Additionally, the order used to arrange the patches into a sequence also significantly impacts their attention distribution. To better understand the attention between patches and explore the attention patterns, we introduce a visual analytics tool specifically designed for vision-based Mamba models. This tool enables a deeper understanding of how attention is distributed across patches in different Mamba blocks and how it evolves throughout a Mamba model. Using the tool, we also investigate the impact of different patch-ordering strategies on the learned attention, offering further insights into the model's behavior.

Paper Structure

This paper contains 20 sections, 7 figures, 2 algorithms.

Figures (7)

  • Figure 1: The four-way cross-scan in VMamba liu2024vmamba.
  • Figure 2: The architecture of VMamba for image classification liu2024vmamba.
  • Figure 3: Attention pattern similarity between blocks from the four stages of the VMamba model in Fig. \ref{['fig:mamba']}.
  • Figure 4: Patches in the same column exhibit similar attention.
  • Figure 5: The attention pattern similarity between patches from blocks of different stages. (a1-j1) Two, two, four, and two blocks from stages 0, 1, 2, and 3 are shown, respectively. (a2-j2) Selecting a patch to inspect its attention pattern.
  • ...and 2 more figures