Table of Contents
Fetching ...

Selective Visual Prompting in Vision Mamba

Yifeng Yao, Zichen Liu, Zhenyu Cui, Yuxin Peng, Jiahuan Zhou

TL;DR

This work tackles the inefficiency of fine-tuning pre-trained Vision Mamba (Vim) models for downstream tasks by introducing Selective Visual Prompting (SVP). SVP generates input-dependent, token-level prompts via a dual-pathCross-Prompting and Inner-Prompting framework that adaptively activates Vim's update and forget gates to propagate discriminative information across the sequence. The method keeps the Vim encoder frozen and only trains lightweight modules (G^C, G^I, alpha, beta) along with the classifier, achieving state-of-the-art results on HTA and VTAB-1K with a small parameter footprint. Results are complemented by ablations and visualizations showing enhanced gate activation and information propagation, and code is released for reproducibility.

Abstract

Pre-trained Vision Mamba (Vim) models have demonstrated exceptional performance across various computer vision tasks in a computationally efficient manner, attributed to their unique design of selective state space models. To further extend their applicability to diverse downstream vision tasks, Vim models can be adapted using the efficient fine-tuning technique known as visual prompting. However, existing visual prompting methods are predominantly tailored for Vision Transformer (ViT)-based models that leverage global attention, neglecting the distinctive sequential token-wise compression and propagation characteristics of Vim. Specifically, existing prompt tokens prefixed to the sequence are insufficient to effectively activate the input and forget gates across the entire sequence, hindering the extraction and propagation of discriminative information. To address this limitation, we introduce a novel Selective Visual Prompting (SVP) method specifically for the efficient fine-tuning of Vim. To prevent the loss of discriminative information during state space propagation, SVP employs lightweight selective prompters for token-wise prompt generation, ensuring adaptive activation of the update and forget gates within Mamba blocks to promote discriminative information propagation. Moreover, considering that Vim propagates both shared cross-layer information and specific inner-layer information, we further refine SVP with a dual-path structure: Cross-Prompting and Inner-Prompting. Cross-Prompting utilizes shared parameters across layers, while Inner-Prompting employs distinct parameters, promoting the propagation of both shared and specific information, respectively. Extensive experimental results on various large-scale benchmarks demonstrate that our proposed SVP significantly outperforms state-of-the-art methods. Our code is available at https://github.com/zhoujiahuan1991/AAAI2025-SVP.

Selective Visual Prompting in Vision Mamba

TL;DR

This work tackles the inefficiency of fine-tuning pre-trained Vision Mamba (Vim) models for downstream tasks by introducing Selective Visual Prompting (SVP). SVP generates input-dependent, token-level prompts via a dual-pathCross-Prompting and Inner-Prompting framework that adaptively activates Vim's update and forget gates to propagate discriminative information across the sequence. The method keeps the Vim encoder frozen and only trains lightweight modules (G^C, G^I, alpha, beta) along with the classifier, achieving state-of-the-art results on HTA and VTAB-1K with a small parameter footprint. Results are complemented by ablations and visualizations showing enhanced gate activation and information propagation, and code is released for reproducibility.

Abstract

Pre-trained Vision Mamba (Vim) models have demonstrated exceptional performance across various computer vision tasks in a computationally efficient manner, attributed to their unique design of selective state space models. To further extend their applicability to diverse downstream vision tasks, Vim models can be adapted using the efficient fine-tuning technique known as visual prompting. However, existing visual prompting methods are predominantly tailored for Vision Transformer (ViT)-based models that leverage global attention, neglecting the distinctive sequential token-wise compression and propagation characteristics of Vim. Specifically, existing prompt tokens prefixed to the sequence are insufficient to effectively activate the input and forget gates across the entire sequence, hindering the extraction and propagation of discriminative information. To address this limitation, we introduce a novel Selective Visual Prompting (SVP) method specifically for the efficient fine-tuning of Vim. To prevent the loss of discriminative information during state space propagation, SVP employs lightweight selective prompters for token-wise prompt generation, ensuring adaptive activation of the update and forget gates within Mamba blocks to promote discriminative information propagation. Moreover, considering that Vim propagates both shared cross-layer information and specific inner-layer information, we further refine SVP with a dual-path structure: Cross-Prompting and Inner-Prompting. Cross-Prompting utilizes shared parameters across layers, while Inner-Prompting employs distinct parameters, promoting the propagation of both shared and specific information, respectively. Extensive experimental results on various large-scale benchmarks demonstrate that our proposed SVP significantly outperforms state-of-the-art methods. Our code is available at https://github.com/zhoujiahuan1991/AAAI2025-SVP.

Paper Structure

This paper contains 19 sections, 12 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Existing visual prompting methods jia2022visual use prompt sequences prefixed to the image tokens, which hinder discriminative feature propagation in Vim. In contrast, our SVP uses input-dependent selective prompts that better learn the input distribution, activating update and forget gates to enhance object-aware feature propagation.
  • Figure 2: Our SVP employs a dual-path architecture: the Inner-Prompting pathway prompts specific information at each layer, while the Cross-Prompting pathway prompts shared information across layers. Both the Inner-Prompt ${\boldsymbol{p}}^{ I}_i$ and Cross-Prompt ${\boldsymbol{p}}^{ C}_i$ are selectively generated based on the input. They are subsequently coordinated by two element-wise dynamic factors $\boldsymbol{\alpha}_j, \boldsymbol{\beta}_i$ and then superimposed onto the original input.
  • Figure 3: The internal structure of the Mamba block. The parameters $\Delta_i$, $\mathbf{B}_i$, and $\mathbf{C}_i$ are all input-dependent.
  • Figure 4: Ablation results of the number of shared layers in Cross-Prompting.
  • Figure 5: Ablation of hidden dimension in Inner-Prompting.
  • ...and 2 more figures