MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition
Jiazheng Xing, Chao Xu, Mengmeng Wang, Guang Dai, Baigui Sun, Yong Liu, Jingdong Wang, Jian Zhao
TL;DR
MA-FSAR addresses FSAR by integrating parameter-efficient fine-tuning with a fine-grained multimodal design that couples CLIP's visual encoder with text-guided supervision. The method introduces Global Temporal Adaptation, Local Spatiotemporal Adaptation, and a Joint Adaptation, all implemented as lightweight adapters, plus a Text-guided Prototype Construction Module to refine class prototypes using support-set text. Empirical results across spatial and temporal datasets show substantial improvements over full fine-tuning and prior CLIP-based methods, with strong gains in 1-shot tasks and maintains efficiency in memory and training time. The work demonstrates the value of multimodal token-level adaptation and prototype-level text guidance for robust, data-efficient video understanding, and points to future directions involving knowledge integration from large language models.
Abstract
Applying large-scale vision-language pre-trained models like CLIP to few-shot action recognition (FSAR) can significantly enhance both performance and efficiency. While several studies have recognized this advantage, most of them resort to full-parameter fine-tuning to make CLIP's visual encoder adapt to the FSAR data, which not only costs high computations but also overlooks the potential of the visual encoder to engage in temporal modeling and focus on targeted semantics directly. To tackle these issues, we introduce MA-FSAR, a framework that employs the Parameter-Efficient Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action-related temporal and semantic representations. Our solution involves a Fine-grained Multimodal Adaptation, which is different from the previous attempts of PEFT in regular action recognition. Specifically, we first insert a Global Temporal Adaptation that only receives the class token to capture global motion cues efficiently. Then these outputs integrate with visual tokens to enhance local temporal dynamics by a Local Multimodal Adaptation, which incorporates text features unique to the FSAR support set branch to highlight fine-grained semantics related to actions. In addition to these token-level designs, we propose a prototype-level text-guided construction module to further enrich the temporal and semantic characteristics of video prototypes. Extensive experiments demonstrate our superior performance in various tasks using minor trainable parameters.
