Manta: Enhancing Mamba for Few-Shot Action Recognition of Long Sub-Sequence

Wenbo Huang; Jinghui Zhang; Guang Li; Lei Zhang; Shuoyuan Wang; Fang Dong; Jiahui Jin; Takahiro Ogawa; Miki Haseyama

Manta: Enhancing Mamba for Few-Shot Action Recognition of Long Sub-Sequence

Wenbo Huang, Jinghui Zhang, Guang Li, Lei Zhang, Shuoyuan Wang, Fang Dong, Jiahui Jin, Takahiro Ogawa, Miki Haseyama

TL;DR

Manta tackles the challenge of few-shot action recognition on long sub-sequences by integrating Matryoshka Mamba, which emphasizes local features and implicit temporal alignment across multiple scales, with a hybrid supervised-unsupervised contrastive learning branch to curb intra-class variance. The method combines a prototype-based cross-entropy objective with a comprehensive contrastive loss, achieving state-of-the-art results on SSv2, Kinetics, UCF101, and HMDB51 across ResNet-50, ViT-B, and VMamba-B backbones. Extensive ablations show the benefits of local-feature modules, multi-scale design, and robust alignment, as well as strong performance under long sequences, cross-dataset settings, and frame-level noise. The work significantly improves practical FSAR by enabling efficient long-subsequence modeling and more reliable clustering and recognition in few-shot regimes, with broad implications for scalable video understanding.

Abstract

In few-shot action recognition (FSAR), long sub-sequences of video naturally express entire actions more effectively. However, the high computational complexity of mainstream Transformer-based methods limits their application. Recent Mamba demonstrates efficiency in modeling long sequences, but directly applying Mamba to FSAR overlooks the importance of local feature modeling and alignment. Moreover, long sub-sequences within the same class accumulate intra-class variance, which adversely impacts FSAR performance. To solve these challenges, we propose a Matryoshka MAmba and CoNtrasTive LeArning framework (Manta). Firstly, the Matryoshka Mamba introduces multiple Inner Modules to enhance local feature representation, rather than directly modeling global features. An Outer Module captures dependencies of timeline between these local features for implicit temporal alignment. Secondly, a hybrid contrastive learning paradigm, combining both supervised and unsupervised methods, is designed to mitigate the negative effects of intra-class variance accumulation. The Matryoshka Mamba and the hybrid contrastive learning paradigm operate in two parallel branches within Manta, enhancing Mamba for FSAR of long sub-sequence. Manta achieves new state-of-the-art performance on prominent benchmarks, including SSv2, Kinetics, UCF101, and HMDB51. Extensive empirical studies prove that Manta significantly improves FSAR of long sub-sequence from multiple perspectives.

Manta: Enhancing Mamba for Few-Shot Action Recognition of Long Sub-Sequence

TL;DR

Abstract

Manta: Enhancing Mamba for Few-Shot Action Recognition of Long Sub-Sequence

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)