Table of Contents
Fetching ...

MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition

Jiazheng Xing, Chao Xu, Mengmeng Wang, Guang Dai, Baigui Sun, Yong Liu, Jingdong Wang, Jian Zhao

TL;DR

MA-FSAR addresses FSAR by integrating parameter-efficient fine-tuning with a fine-grained multimodal design that couples CLIP's visual encoder with text-guided supervision. The method introduces Global Temporal Adaptation, Local Spatiotemporal Adaptation, and a Joint Adaptation, all implemented as lightweight adapters, plus a Text-guided Prototype Construction Module to refine class prototypes using support-set text. Empirical results across spatial and temporal datasets show substantial improvements over full fine-tuning and prior CLIP-based methods, with strong gains in 1-shot tasks and maintains efficiency in memory and training time. The work demonstrates the value of multimodal token-level adaptation and prototype-level text guidance for robust, data-efficient video understanding, and points to future directions involving knowledge integration from large language models.

Abstract

Applying large-scale vision-language pre-trained models like CLIP to few-shot action recognition (FSAR) can significantly enhance both performance and efficiency. While several studies have recognized this advantage, most of them resort to full-parameter fine-tuning to make CLIP's visual encoder adapt to the FSAR data, which not only costs high computations but also overlooks the potential of the visual encoder to engage in temporal modeling and focus on targeted semantics directly. To tackle these issues, we introduce MA-FSAR, a framework that employs the Parameter-Efficient Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action-related temporal and semantic representations. Our solution involves a Fine-grained Multimodal Adaptation, which is different from the previous attempts of PEFT in regular action recognition. Specifically, we first insert a Global Temporal Adaptation that only receives the class token to capture global motion cues efficiently. Then these outputs integrate with visual tokens to enhance local temporal dynamics by a Local Multimodal Adaptation, which incorporates text features unique to the FSAR support set branch to highlight fine-grained semantics related to actions. In addition to these token-level designs, we propose a prototype-level text-guided construction module to further enrich the temporal and semantic characteristics of video prototypes. Extensive experiments demonstrate our superior performance in various tasks using minor trainable parameters.

MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition

TL;DR

MA-FSAR addresses FSAR by integrating parameter-efficient fine-tuning with a fine-grained multimodal design that couples CLIP's visual encoder with text-guided supervision. The method introduces Global Temporal Adaptation, Local Spatiotemporal Adaptation, and a Joint Adaptation, all implemented as lightweight adapters, plus a Text-guided Prototype Construction Module to refine class prototypes using support-set text. Empirical results across spatial and temporal datasets show substantial improvements over full fine-tuning and prior CLIP-based methods, with strong gains in 1-shot tasks and maintains efficiency in memory and training time. The work demonstrates the value of multimodal token-level adaptation and prototype-level text guidance for robust, data-efficient video understanding, and points to future directions involving knowledge integration from large language models.

Abstract

Applying large-scale vision-language pre-trained models like CLIP to few-shot action recognition (FSAR) can significantly enhance both performance and efficiency. While several studies have recognized this advantage, most of them resort to full-parameter fine-tuning to make CLIP's visual encoder adapt to the FSAR data, which not only costs high computations but also overlooks the potential of the visual encoder to engage in temporal modeling and focus on targeted semantics directly. To tackle these issues, we introduce MA-FSAR, a framework that employs the Parameter-Efficient Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action-related temporal and semantic representations. Our solution involves a Fine-grained Multimodal Adaptation, which is different from the previous attempts of PEFT in regular action recognition. Specifically, we first insert a Global Temporal Adaptation that only receives the class token to capture global motion cues efficiently. Then these outputs integrate with visual tokens to enhance local temporal dynamics by a Local Multimodal Adaptation, which incorporates text features unique to the FSAR support set branch to highlight fine-grained semantics related to actions. In addition to these token-level designs, we propose a prototype-level text-guided construction module to further enrich the temporal and semantic characteristics of video prototypes. Extensive experiments demonstrate our superior performance in various tasks using minor trainable parameters.
Paper Structure (37 sections, 17 equations, 6 figures, 15 tables)

This paper contains 37 sections, 17 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: (a): (i) AIM yang2023aim, a method that successfully applied PEFT technology in action recognition; (ii) The support branch of CLIP-FSAR wang2023clip, a representative method that fully fine-tunes CLIP for few-shot action recognition; and (iii) the pipeline of our proposed method's support branch. (b): Visualization of the attention map at the visual encoder's last layer for the proposed MA-FSAR and AIM yang2023aim. AIM serves for action recognition as a classification task, whereas few-shot action recognition is a matching task. Therefore, for a fair comparison, both methods use the same few-shot temporal alignment metric, OTAM cao2020few. For the comparison result, the attention maps from our method are more focused on action-related objects due to the integration of text tokens and visual tokens in the visual encoder. (c): Performance comparison of different few-shot action recognition methods in the SSv2-Small 5-way 1-shot task, including our MA-FSAR, OTAM cao2020few, TRX perrett2021temporal, STRM thatipelli2022spatio, HyRSM wang2022hybrid, MoLo wang2023molo and CLIP-FSAR wang2023clip. Bubble or star size indicates the recognition accuracy. Our MA-FSAR achieves the highest recognition accuracy with the least number of trainable parameters.
  • Figure 2: Overview of MA-FSAR. For simplicity and convenience, we focus on a specific scenario: the 5-way 1-shot task with a query set $\mathcal{Q}$ containing a single video. The support set video features $\textbf{F}_{\mathcal{S}}$ and query video feature $\textbf{F}_{\mathcal{Q}}$ are obtained by the visual encoder with the Fine-grained Multimodal Adaptation (FgMA). Text features $\textbf{F}_{\mathcal{T}}$ are obtained through a text encoder. The Text-guided Prototype Construction Module (TPCM) generates the final features before the prototype matching, denoted as $\widetilde{\textbf{F}_\mathcal{S}}$ and $\widetilde{\textbf{F}_\mathcal{Q}}$. The probability distribution $\textbf{p}_{\mathcal{Q}2\mathcal{T}}$ is obtained using cosine similarity metric, and $\textbf{p}_{\mathcal{Q}2\mathcal{S}}$ is calculated using prototype matching metric. The loss $\mathcal{L}_{\mathcal{Q}2\mathcal{S}}$ is the standard Cross-Entropy loss, while $\mathcal{L}_{\mathcal{S}2\mathcal{T}}$ and $\mathcal{L}_{\mathcal{Q}2\mathcal{T}}$ are Kullback-Leibler divergence (KL) losses.
  • Figure 3: (a) shows the structure of the Adapter houlsby2019parameter, and (b) shows the structure of a standard ViT dosovitskiy2020image block. (c) and (d) illustrate the fine-grained multimodal adaptation of each ViT block for the support and query set branch. Note that GT-MSA, LM-MSA, and LST-MSA share weights but are applied to different inputs with different motivations for global temporal, local multimodal, and local spatiotemporal modeling.
  • Figure 4: (a) and (b) respectively show the structure of the TPCM module for the support set and query set branch. $\oplus$ denotes element-wise summation.
  • Figure 5: The inference process of MA-FASR
  • ...and 1 more figures