Table of Contents
Fetching ...

On the low-shot transferability of [V]-Mamba

Diganta Misra, Jay Gala, Antonio Orvieto

TL;DR

This work investigates how well [V]-Mamba variants transfer in few-shot settings compared to Vision Transformers (ViTs) across seven downstream datasets. It benchmarks two transfer methods—linear probing (LP) and iterative visual prompting (ILM-VP)—using pretrained ImageNet-1k models and a common training protocol. The results show that [V]-Mamba typically outperforms or matches ViTs under LP but underperforms or matches ViTs under VP, with the LP-VP transfer gap correlating weakly with model scale. These findings illuminate method- and scale-dependent transfer dynamics for Visual Mamba models, guiding future research on efficient adaptation of SSM-based vision architectures.

Abstract

The strength of modern large-scale neural networks lies in their ability to efficiently adapt to new tasks with few examples. Although extensive research has investigated the transferability of Vision Transformers (ViTs) to various downstream tasks under diverse constraints, this study shifts focus to explore the transfer learning potential of [V]-Mamba. We compare its performance with ViTs across different few-shot data budgets and efficient transfer methods. Our analysis yields three key insights into [V]-Mamba's few-shot transfer performance: (a) [V]-Mamba demonstrates superior or equivalent few-shot learning capabilities compared to ViTs when utilizing linear probing (LP) for transfer, (b) Conversely, [V]-Mamba exhibits weaker or similar few-shot learning performance compared to ViTs when employing visual prompting (VP) as the transfer method, and (c) We observe a weak positive correlation between the performance gap in transfer via LP and VP and the scale of the [V]-Mamba model. This preliminary analysis lays the foundation for more comprehensive studies aimed at furthering our understanding of the capabilities of [V]-Mamba variants and their distinctions from ViTs.

On the low-shot transferability of [V]-Mamba

TL;DR

This work investigates how well [V]-Mamba variants transfer in few-shot settings compared to Vision Transformers (ViTs) across seven downstream datasets. It benchmarks two transfer methods—linear probing (LP) and iterative visual prompting (ILM-VP)—using pretrained ImageNet-1k models and a common training protocol. The results show that [V]-Mamba typically outperforms or matches ViTs under LP but underperforms or matches ViTs under VP, with the LP-VP transfer gap correlating weakly with model scale. These findings illuminate method- and scale-dependent transfer dynamics for Visual Mamba models, guiding future research on efficient adaptation of SSM-based vision architectures.

Abstract

The strength of modern large-scale neural networks lies in their ability to efficiently adapt to new tasks with few examples. Although extensive research has investigated the transferability of Vision Transformers (ViTs) to various downstream tasks under diverse constraints, this study shifts focus to explore the transfer learning potential of [V]-Mamba. We compare its performance with ViTs across different few-shot data budgets and efficient transfer methods. Our analysis yields three key insights into [V]-Mamba's few-shot transfer performance: (a) [V]-Mamba demonstrates superior or equivalent few-shot learning capabilities compared to ViTs when utilizing linear probing (LP) for transfer, (b) Conversely, [V]-Mamba exhibits weaker or similar few-shot learning performance compared to ViTs when employing visual prompting (VP) as the transfer method, and (c) We observe a weak positive correlation between the performance gap in transfer via LP and VP and the scale of the [V]-Mamba model. This preliminary analysis lays the foundation for more comprehensive studies aimed at furthering our understanding of the capabilities of [V]-Mamba variants and their distinctions from ViTs.
Paper Structure (8 sections, 3 figures, 1 table)

This paper contains 8 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Comparison of label mappings from the source dataset (ImageNet-1k deng2009imagenet) between VSSM-Tiny (VSSM-T) liu2024vmamba and VSSM-Small (VSSM-S) zhu2024vision models when transferred via ILM-VP chen2023understanding for target classes in the CIFAR-10 krizhevsky2009learning dataset. VSSM-Tiny demonstrates a more semantically accurate label mapping from the source dataset, while VSSM-Small associates target classes with semantically unrelated classes from the source dataset. Furthermore, the test accuracy at the bottom confirms the superiority of VSSM-Tiny over VSSM-Small.
  • Figure 2: Transfer performance measured by test accuracy of different models of similar scales across various downstream datasets at different $N$-shot settings trained using LP (top) and ILM-VP chen2023understanding (middle) method. $\Delta$ (bottom) denotes the difference in test accuracy between LP and ILM-VP models across varying datasets and data budgets.
  • Figure 3: Transfer performance gap ($\Delta$) measured by the difference in test accuracy between LP and ILM-VP chen2023understanding methods for various SSM models on a variety of downstream datasets at different $N$-shot settings.