Two-stream joint matching method based on contrastive learning for few-shot action recognition
Long Deng, Ziqiang Li, Bingxin Zhou, Zhongming Chen, Ao Li, Yongxin Ge
TL;DR
TSJM tackles core challenges in few-shot action recognition by jointly leveraging RGB appearance and optical-flow motion through a Multi-modal Contrastive Learning Module (MCL) and a Joint Matching Module (JMM) that combines ordered temporal alignment with a weighted bipartite graph matching strategy. An adapter aligns cross-modal features, InfoNCE-based contrastive learning maximizes inter-modal mutual information, and the JMM addresses length/speed variability and sub-action misalignment, yielding four modality-specific similarity scores fused for final prediction. Experiments on SSv2 and Kinetics with 5-way N-shot tasks demonstrate competitive performance and clear ablation gains from MCL, the adapter, and JMM. The approach provides a practical, multi-modal framework for robust few-shot action recognition with potential for efficient deployment across video domains.
Abstract
Although few-shot action recognition based on metric learning paradigm has achieved significant success, it fails to address the following issues: (1) inadequate action relation modeling and underutilization of multi-modal information; (2) challenges in handling video matching problems with different lengths and speeds, and video matching problems with misalignment of video sub-actions. To address these issues, we propose a Two-Stream Joint Matching method based on contrastive learning (TSJM), which consists of two modules: Multi-modal Contrastive Learning Module (MCL) and Joint Matching Module (JMM). The objective of the MCL is to extensively investigate the inter-modal mutual information relationships, thereby thoroughly extracting modal information to enhance the modeling of action relationships. The JMM aims to simultaneously address the aforementioned video matching problems. The effectiveness of the proposed method is evaluated on two widely used few shot action recognition datasets, namely, SSv2 and Kinetics. Comprehensive ablation experiments are also conducted to substantiate the efficacy of our proposed approach.
