Table of Contents
Fetching ...

Manta: Enhancing Mamba for Few-Shot Action Recognition of Long Sub-Sequence

Wenbo Huang, Jinghui Zhang, Guang Li, Lei Zhang, Shuoyuan Wang, Fang Dong, Jiahui Jin, Takahiro Ogawa, Miki Haseyama

TL;DR

Manta tackles the challenge of few-shot action recognition on long sub-sequences by integrating Matryoshka Mamba, which emphasizes local features and implicit temporal alignment across multiple scales, with a hybrid supervised-unsupervised contrastive learning branch to curb intra-class variance. The method combines a prototype-based cross-entropy objective with a comprehensive contrastive loss, achieving state-of-the-art results on SSv2, Kinetics, UCF101, and HMDB51 across ResNet-50, ViT-B, and VMamba-B backbones. Extensive ablations show the benefits of local-feature modules, multi-scale design, and robust alignment, as well as strong performance under long sequences, cross-dataset settings, and frame-level noise. The work significantly improves practical FSAR by enabling efficient long-subsequence modeling and more reliable clustering and recognition in few-shot regimes, with broad implications for scalable video understanding.

Abstract

In few-shot action recognition (FSAR), long sub-sequences of video naturally express entire actions more effectively. However, the high computational complexity of mainstream Transformer-based methods limits their application. Recent Mamba demonstrates efficiency in modeling long sequences, but directly applying Mamba to FSAR overlooks the importance of local feature modeling and alignment. Moreover, long sub-sequences within the same class accumulate intra-class variance, which adversely impacts FSAR performance. To solve these challenges, we propose a Matryoshka MAmba and CoNtrasTive LeArning framework (Manta). Firstly, the Matryoshka Mamba introduces multiple Inner Modules to enhance local feature representation, rather than directly modeling global features. An Outer Module captures dependencies of timeline between these local features for implicit temporal alignment. Secondly, a hybrid contrastive learning paradigm, combining both supervised and unsupervised methods, is designed to mitigate the negative effects of intra-class variance accumulation. The Matryoshka Mamba and the hybrid contrastive learning paradigm operate in two parallel branches within Manta, enhancing Mamba for FSAR of long sub-sequence. Manta achieves new state-of-the-art performance on prominent benchmarks, including SSv2, Kinetics, UCF101, and HMDB51. Extensive empirical studies prove that Manta significantly improves FSAR of long sub-sequence from multiple perspectives.

Manta: Enhancing Mamba for Few-Shot Action Recognition of Long Sub-Sequence

TL;DR

Manta tackles the challenge of few-shot action recognition on long sub-sequences by integrating Matryoshka Mamba, which emphasizes local features and implicit temporal alignment across multiple scales, with a hybrid supervised-unsupervised contrastive learning branch to curb intra-class variance. The method combines a prototype-based cross-entropy objective with a comprehensive contrastive loss, achieving state-of-the-art results on SSv2, Kinetics, UCF101, and HMDB51 across ResNet-50, ViT-B, and VMamba-B backbones. Extensive ablations show the benefits of local-feature modules, multi-scale design, and robust alignment, as well as strong performance under long sequences, cross-dataset settings, and frame-level noise. The work significantly improves practical FSAR by enabling efficient long-subsequence modeling and more reliable clustering and recognition in few-shot regimes, with broad implications for scalable video understanding.

Abstract

In few-shot action recognition (FSAR), long sub-sequences of video naturally express entire actions more effectively. However, the high computational complexity of mainstream Transformer-based methods limits their application. Recent Mamba demonstrates efficiency in modeling long sequences, but directly applying Mamba to FSAR overlooks the importance of local feature modeling and alignment. Moreover, long sub-sequences within the same class accumulate intra-class variance, which adversely impacts FSAR performance. To solve these challenges, we propose a Matryoshka MAmba and CoNtrasTive LeArning framework (Manta). Firstly, the Matryoshka Mamba introduces multiple Inner Modules to enhance local feature representation, rather than directly modeling global features. An Outer Module captures dependencies of timeline between these local features for implicit temporal alignment. Secondly, a hybrid contrastive learning paradigm, combining both supervised and unsupervised methods, is designed to mitigate the negative effects of intra-class variance accumulation. The Matryoshka Mamba and the hybrid contrastive learning paradigm operate in two parallel branches within Manta, enhancing Mamba for FSAR of long sub-sequence. Manta achieves new state-of-the-art performance on prominent benchmarks, including SSv2, Kinetics, UCF101, and HMDB51. Extensive empirical studies prove that Manta significantly improves FSAR of long sub-sequence from multiple perspectives.

Paper Structure

This paper contains 56 sections, 16 equations, 10 figures, 17 tables, 1 algorithm.

Figures (10)

  • Figure 1: In two long sub-sequence examples of “Diving cliff”, significant local features (highlighted as “Falling”) occupy only small portions of the examples and are located at different points in the timeline. Additionally, the frame pairs from these examples exhibit large discrepancies in visual features. As the number of frames increases, intra-class variance gradually accumulates.
  • Figure 2: The overall architecture of the Matryoshka Mamba and Contrastive Learning framework (Manta) with four parts. To be specific, ① Feature Extraction with backbone extracts features from query and support. ② Mamba Branch with Matryoshka Mamba can emphasize local features and execute temporal alignment. ③ Contrastive Branch alleviates the accumulation of intra-class variance by hybrid contrastive learning. ④ Training Objective $\mathcal{L}_{\text{total}}$ is the loss combination of cross-entropy loss $\mathcal{L}_{\text{ce}}$ from ② Mamba Branch and contrastive loss $\mathcal{L}_{\text{hc}}$ from ③ Contrastive Branch. Notion Ⓐ means averaging calculation.
  • Figure 3: The structure of Matryoshka Mamba, +, $\times$, and Ⓒ indicate element-wise addition, multiplication and concatenate operation. Conv2D Block has three 2D convolutions and a batch normalization layer. Red indicates local features while feature itself is dotted line.
  • Figure 4: The structure of Inner Module based on Mamba-2, where Ⓝ, Fw, Bw, AF, and SSM refers to normalization, forward, backward, activation function, and state space model.
  • Figure 5: The bidirectional structure of Outer Module based on Mamba-2, decomposing the input at first. Two sub-branches share parameters.
  • ...and 5 more figures