Table of Contents
Fetching ...

Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition

Xinzhe Ni, Yong Liu, Hao Wen, Yatai Ji, Jing Xiao, Yujiu Yang

TL;DR

The paper tackles few-shot action recognition by enriching prototype-based matching with multimodal information from label texts. It introduces MORN, which uses CLIP-based visual and text encoders, a semantic-enhanced text stream, and a multimodal prototype-enhanced (MPE) fusion to form robust prototypes, augmented by the PRIDE metric to quantify prototype quality. Empirical results on HMDB51, UCF101, Kinetics, and SSv2 achieve state-of-the-art performance, and incorporating PRIDE into training yields additional gains. The work emphasizes that high-quality multimodal prototypes substantially improve discriminability in data-scarce regimes, offering a practical route to stronger few-shot action recognition systems.

Abstract

Current methods for few-shot action recognition mainly fall into the metric learning framework following ProtoNet, which demonstrates the importance of prototypes. Although they achieve relatively good performance, the effect of multimodal information is ignored, e.g. label texts. In this work, we propose a novel MultimOdal PRototype-ENhanced Network (MORN), which uses the semantic information of label texts as multimodal information to enhance prototypes. A CLIP visual encoder and a frozen CLIP text encoder are introduced to obtain features with good multimodal initialization. Then in the visual flow, visual prototypes are computed by a visual prototype-computed module. In the text flow, a semantic-enhanced (SE) module and an inflating operation are used to obtain text prototypes. The final multimodal prototypes are then computed by a multimodal prototype-enhanced (MPE) module. Besides, we define a PRototype SImilarity DiffErence (PRIDE) to evaluate the quality of prototypes, which is used to verify our improvement on the prototype level and effectiveness of MORN. We conduct extensive experiments on four popular few-shot action recognition datasets: HMDB51, UCF101, Kinetics and SSv2, and MORN achieves state-of-the-art results. When plugging PRIDE into the training stage, the performance can be further improved.

Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition

TL;DR

The paper tackles few-shot action recognition by enriching prototype-based matching with multimodal information from label texts. It introduces MORN, which uses CLIP-based visual and text encoders, a semantic-enhanced text stream, and a multimodal prototype-enhanced (MPE) fusion to form robust prototypes, augmented by the PRIDE metric to quantify prototype quality. Empirical results on HMDB51, UCF101, Kinetics, and SSv2 achieve state-of-the-art performance, and incorporating PRIDE into training yields additional gains. The work emphasizes that high-quality multimodal prototypes substantially improve discriminability in data-scarce regimes, offering a practical route to stronger few-shot action recognition systems.

Abstract

Current methods for few-shot action recognition mainly fall into the metric learning framework following ProtoNet, which demonstrates the importance of prototypes. Although they achieve relatively good performance, the effect of multimodal information is ignored, e.g. label texts. In this work, we propose a novel MultimOdal PRototype-ENhanced Network (MORN), which uses the semantic information of label texts as multimodal information to enhance prototypes. A CLIP visual encoder and a frozen CLIP text encoder are introduced to obtain features with good multimodal initialization. Then in the visual flow, visual prototypes are computed by a visual prototype-computed module. In the text flow, a semantic-enhanced (SE) module and an inflating operation are used to obtain text prototypes. The final multimodal prototypes are then computed by a multimodal prototype-enhanced (MPE) module. Besides, we define a PRototype SImilarity DiffErence (PRIDE) to evaluate the quality of prototypes, which is used to verify our improvement on the prototype level and effectiveness of MORN. We conduct extensive experiments on four popular few-shot action recognition datasets: HMDB51, UCF101, Kinetics and SSv2, and MORN achieves state-of-the-art results. When plugging PRIDE into the training stage, the performance can be further improved.
Paper Structure (13 sections, 14 equations, 6 figures, 6 tables)

This paper contains 13 sections, 14 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Existing metric-learning strategy (a) and (b) and our multimodal prototype-enhanced strategy (c) for few-shot action recognition. Results on TRX perrett2021temporal with CLIP visual encoder, multimodal feature-enhanced strategy and multimodal prototype-enhanced strategy are shown.
  • Figure 2: Overview of our proposed MORN on TRX on a 2-way 1-shot problem with 1 video for each category in the query set. In the visual flow, a CLIP visual encoder is first introduced on videos with $L$ frames to obtain video features. Then, support video features are passed to the Temporal-Relational CrossTransformer (TRX) module to compute visual prototypes. In the text flow, a frozen CLIP text encoder is first introduced on the prompted label texts. Then, the semantic features of label texts are passed to a semantic-enhanced (SE) module and are inflated as text prototypes. The visual prototypes and the text prototypes are combined through a multimodal prototype-enhanced (MPE) module.
  • Figure 3: Performance gains of PRIDE and accuracy on HMDB51. MORN achieves PRIDE gains in (a) and accuracy gains in (b).
  • Figure 4: t-SNE van2008visualizing projection of prototypes of each episode and real prototypes on HMDB51.
  • Figure 5: Overview of our proposed MORN with PRIDE loss on a 2-way 1-shot problem with 2 videos for each category in the query set. CE loss and PRIDE loss are combined through a learnable weight.
  • ...and 1 more figures