Two-stream joint matching method based on contrastive learning for few-shot action recognition

Long Deng; Ziqiang Li; Bingxin Zhou; Zhongming Chen; Ao Li; Yongxin Ge

Two-stream joint matching method based on contrastive learning for few-shot action recognition

Long Deng, Ziqiang Li, Bingxin Zhou, Zhongming Chen, Ao Li, Yongxin Ge

TL;DR

TSJM tackles core challenges in few-shot action recognition by jointly leveraging RGB appearance and optical-flow motion through a Multi-modal Contrastive Learning Module (MCL) and a Joint Matching Module (JMM) that combines ordered temporal alignment with a weighted bipartite graph matching strategy. An adapter aligns cross-modal features, InfoNCE-based contrastive learning maximizes inter-modal mutual information, and the JMM addresses length/speed variability and sub-action misalignment, yielding four modality-specific similarity scores fused for final prediction. Experiments on SSv2 and Kinetics with 5-way N-shot tasks demonstrate competitive performance and clear ablation gains from MCL, the adapter, and JMM. The approach provides a practical, multi-modal framework for robust few-shot action recognition with potential for efficient deployment across video domains.

Abstract

Although few-shot action recognition based on metric learning paradigm has achieved significant success, it fails to address the following issues: (1) inadequate action relation modeling and underutilization of multi-modal information; (2) challenges in handling video matching problems with different lengths and speeds, and video matching problems with misalignment of video sub-actions. To address these issues, we propose a Two-Stream Joint Matching method based on contrastive learning (TSJM), which consists of two modules: Multi-modal Contrastive Learning Module (MCL) and Joint Matching Module (JMM). The objective of the MCL is to extensively investigate the inter-modal mutual information relationships, thereby thoroughly extracting modal information to enhance the modeling of action relationships. The JMM aims to simultaneously address the aforementioned video matching problems. The effectiveness of the proposed method is evaluated on two widely used few shot action recognition datasets, namely, SSv2 and Kinetics. Comprehensive ablation experiments are also conducted to substantiate the efficacy of our proposed approach.

Two-stream joint matching method based on contrastive learning for few-shot action recognition

TL;DR

Abstract

Paper Structure (12 sections, 8 equations, 3 figures, 2 tables)

This paper contains 12 sections, 8 equations, 3 figures, 2 tables.

Introduction
Related Work
Method
Problem Formulation
Overview
MCL
JMM
Experiments
Datasets
Implementation Details
Comparison with State-of-the-Art Methods
Ablation Study

Figures (3)

Figure 1: In the equivalent category of “Spreading something onto something", the fundamental behavioral actions remain consistent. However, the agents executing these actions are entirely distinct, represented by toy building blocks and a clamp. Although behavioral actions demonstrate fundamental similarities, an observable misalignment phenomenon arises in the manifestation of sub-actions, accompanied by a distinct reversal in the direction of motion.
Figure 2: Illustration of the comprehensive framework of our method, taking the 3-way 1-shot task as an example.
Figure 3: N-way 1-shot on the Kinetics and SSv2.

Two-stream joint matching method based on contrastive learning for few-shot action recognition

TL;DR

Abstract

Two-stream joint matching method based on contrastive learning for few-shot action recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (3)