Table of Contents
Fetching ...

ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning

Wenjin Hou, Dingjie Fu, Kun Li, Shiming Chen, Hehe Fan, Yi Yang

TL;DR

ZeroMamba tackles zero-shot learning by embedding semantic guidance directly into a Vision Mamba backbone. It introduces three modules—Semantic-aware Local Projection (SLP), Global Representation Learning (GRL), and Semantic Fusion (SeF)—to learn complementary local and global semantic representations and fuse them for discriminative classification, optimized with a semantic constraint and cosine-based matching in a joint loss. Empirically, ZeroMamba achieves state-of-the-art performance on CZSL and GZSL across CUB, SUN, and AWA2, with strong generalization on ImageNet under limited training data, while maintaining a favorable accuracy–parameter trade-off. The work demonstrates that a parameter-efficient, globally receptive back-end like Vision Mamba, when guided by semantic-aware modules, can outperform heavier CNN- or ViT-based ZSL pipelines and offers a solid, scalable baseline for future visual-semantic learning.

Abstract

Zero-shot learning (ZSL) aims to recognize unseen classes by transferring semantic knowledge from seen classes to unseen ones, guided by semantic information. To this end, existing works have demonstrated remarkable performance by utilizing global visual features from Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs) for visual-semantic interactions. Due to the limited receptive fields of CNNs and the quadratic complexity of ViTs, however, these visual backbones achieve suboptimal visual-semantic interactions. In this paper, motivated by the visual state space model (i.e., Vision Mamba), which is capable of capturing long-range dependencies and modeling complex visual dynamics, we propose a parameter-efficient ZSL framework called ZeroMamba to advance ZSL. Our ZeroMamba comprises three key components: Semantic-aware Local Projection (SLP), Global Representation Learning (GRL), and Semantic Fusion (SeF). Specifically, SLP integrates semantic embeddings to map visual features to local semantic-related representations, while GRL encourages the model to learn global semantic representations. SeF combines these two semantic representations to enhance the discriminability of semantic features. We incorporate these designs into Vision Mamba, forming an end-to-end ZSL framework. As a result, the learned semantic representations are better suited for classification. Through extensive experiments on four prominent ZSL benchmarks, ZeroMamba demonstrates superior performance, significantly outperforming the state-of-the-art (i.e., CNN-based and ViT-based) methods under both conventional ZSL (CZSL) and generalized ZSL (GZSL) settings. Code is available at: https://anonymous.4open.science/r/ZeroMamba.

ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning

TL;DR

ZeroMamba tackles zero-shot learning by embedding semantic guidance directly into a Vision Mamba backbone. It introduces three modules—Semantic-aware Local Projection (SLP), Global Representation Learning (GRL), and Semantic Fusion (SeF)—to learn complementary local and global semantic representations and fuse them for discriminative classification, optimized with a semantic constraint and cosine-based matching in a joint loss. Empirically, ZeroMamba achieves state-of-the-art performance on CZSL and GZSL across CUB, SUN, and AWA2, with strong generalization on ImageNet under limited training data, while maintaining a favorable accuracy–parameter trade-off. The work demonstrates that a parameter-efficient, globally receptive back-end like Vision Mamba, when guided by semantic-aware modules, can outperform heavier CNN- or ViT-based ZSL pipelines and offers a solid, scalable baseline for future visual-semantic learning.

Abstract

Zero-shot learning (ZSL) aims to recognize unseen classes by transferring semantic knowledge from seen classes to unseen ones, guided by semantic information. To this end, existing works have demonstrated remarkable performance by utilizing global visual features from Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs) for visual-semantic interactions. Due to the limited receptive fields of CNNs and the quadratic complexity of ViTs, however, these visual backbones achieve suboptimal visual-semantic interactions. In this paper, motivated by the visual state space model (i.e., Vision Mamba), which is capable of capturing long-range dependencies and modeling complex visual dynamics, we propose a parameter-efficient ZSL framework called ZeroMamba to advance ZSL. Our ZeroMamba comprises three key components: Semantic-aware Local Projection (SLP), Global Representation Learning (GRL), and Semantic Fusion (SeF). Specifically, SLP integrates semantic embeddings to map visual features to local semantic-related representations, while GRL encourages the model to learn global semantic representations. SeF combines these two semantic representations to enhance the discriminability of semantic features. We incorporate these designs into Vision Mamba, forming an end-to-end ZSL framework. As a result, the learned semantic representations are better suited for classification. Through extensive experiments on four prominent ZSL benchmarks, ZeroMamba demonstrates superior performance, significantly outperforming the state-of-the-art (i.e., CNN-based and ViT-based) methods under both conventional ZSL (CZSL) and generalized ZSL (GZSL) settings. Code is available at: https://anonymous.4open.science/r/ZeroMamba.
Paper Structure (16 sections, 14 equations, 11 figures, 7 tables)

This paper contains 16 sections, 14 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Compared with state-of-the-art CNN-based and ViT-based methods on CUB, our proposed ZeroMamba achieves the best trade-off between Accuracy and Number of parameters.
  • Figure 2: Left: The framework of ZeroMamba. Image patches are traversed along four-way scanning and fed into the Vision Mamba Encoder. The major innovation of our work lies in three simple yet effective designs: the SLP module, the GRL module, and the SeF strategy. By incorporating these designs into vanilla VMamba liu2024vmamba, we form an end-to-end framework capable of robust ZSL. Right: The structural details of the SLP module, the GRL module, and the SeF strategy.
  • Figure 3: Visualization of the activation maps of different visual backbones, including CNN (e.g., ResNet-101 he2016deep), ViT (e.g., ViT-Base dosovitskiy2020image), and ZeroMamba. Our ZeroMamba can accurately capture the semantic-related information. We use CUB as an example, with the red box indicating challenging cases.
  • Figure 4: (Please Zoom in for details.) Visualizations with t-SNE embeddings of different methods produced by (a) DAZLE huynh2020fine, (b) ZSLViT chen2024progressive, and (c) ZeroMamba (ours) in both visual and semantic spaces. The 10 colors denote 10 different classes randomly selected from CUB. (Best viewed in shape and color.)
  • Figure 5: Comparison of effective receptive fields (ERF) luo2016understanding between CNN (e.g., ResNet-101 he2016deep), ViT (e.g., DeiT touvron2021training) and ZeroMamba. Pixels with higher intensity (darker color) indicate a stronger response to the central pixel. It is evident that ZeroMamba and ViT exhibit a global receptive field, while CNN only has a local receptive field on CUB.
  • ...and 6 more figures