Table of Contents
Fetching ...

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

Mengmeng Wang, Jiazheng Xing, Boyuan Jiang, Jun Chen, Jianbiao Mei, Xingxing Zuo, Guang Dai, Jingdong Wang, Yong Liu

TL;DR

This work tackles the challenge of transferring large vision-language models like CLIP to video action recognition without sacrificing cross-modal generalization. It introduces M$^2$-CLIP, which freezes CLIP backbones and augments them with TED-Adapter and Text-Adapter for robust temporal and semantic representation, complemented by a four-head multi-task decoder (contrastive, cross-modal classification, cross-modal masked language modeling, and visual classification). The approach achieves strong supervised performance with a small fraction of trainable parameters and delivers state-of-the-art zero-shot transfer on multiple benchmarks, outperforming several unimodal and multimodal PEFT methods. Practically, this framework provides a scalable path to deploy powerful CLIP-based video understanding with efficient fine-tuning and robust generalization.

Abstract

Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend to prioritize strong supervised performance at the expense of compromising the models' generalization capabilities during transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named \name to address these challenges, preserving both high supervised performance and robust transferability. Firstly, to enhance the individual modality architectures, we introduce multimodal adapters to both the visual and text branches. Specifically, we design a novel visual TED-Adapter, that performs global Temporal Enhancement and local temporal Difference modeling to improve the temporal representation capabilities of the visual encoder. Moreover, we adopt text encoder adapters to strengthen the learning of semantic label information. Secondly, we design a multi-task decoder with a rich set of supervisory signals to adeptly satisfy the need for strong supervised performance and generalization within a multimodal framework. Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

TL;DR

This work tackles the challenge of transferring large vision-language models like CLIP to video action recognition without sacrificing cross-modal generalization. It introduces M-CLIP, which freezes CLIP backbones and augments them with TED-Adapter and Text-Adapter for robust temporal and semantic representation, complemented by a four-head multi-task decoder (contrastive, cross-modal classification, cross-modal masked language modeling, and visual classification). The approach achieves strong supervised performance with a small fraction of trainable parameters and delivers state-of-the-art zero-shot transfer on multiple benchmarks, outperforming several unimodal and multimodal PEFT methods. Practically, this framework provides a scalable path to deploy powerful CLIP-based video understanding with efficient fine-tuning and robust generalization.

Abstract

Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend to prioritize strong supervised performance at the expense of compromising the models' generalization capabilities during transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named \name to address these challenges, preserving both high supervised performance and robust transferability. Firstly, to enhance the individual modality architectures, we introduce multimodal adapters to both the visual and text branches. Specifically, we design a novel visual TED-Adapter, that performs global Temporal Enhancement and local temporal Difference modeling to improve the temporal representation capabilities of the visual encoder. Moreover, we adopt text encoder adapters to strengthen the learning of semantic label information. Secondly, we design a multi-task decoder with a rich set of supervisory signals to adeptly satisfy the need for strong supervised performance and generalization within a multimodal framework. Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.
Paper Structure (14 sections, 12 equations, 4 figures, 4 tables)

This paper contains 14 sections, 12 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Performance Comparison: Zero-shot vs supervised accuracy. The circle area represents the number of tunable parameters, where models with better performance are positioned towards the right and upper side, with a small circle area. Our M$^2$-CLIP achieves the best zero-shot performance with very few tunable parameters.
  • Figure 2: Analysis of transferring a unimodal framework into a multimodal one. (a) Performance comparison. Note that ST-Adapter is not able to zero-shot transferring, thus having no results in zero-shot UCF101 and HMDB51. (b) Inter-class correlation maps of the top 40 correlated SSv2 label features of ST-Adapter+text vs the corresponding 40 label features of our method. The redder the color, the stronger the feature coupling. Our M$^2$-CLIP ultimately improved the performance on the four datasets and significantly reduced the correlation among the features of different labels.
  • Figure 3: (a) Overview of M$^2$-CLIP: An example of integrating an adapter into each transformer layer is illustrated. M$^2$-CLIP consists of a video encoder, a text encoder and a multi-task decoder, where the backbones of the two encoders are frozen and assisted by the proposed trainable TED-Adapter and Text-Adapter. The multi-task decoder has four different heads that utilize multi-task constraints to improve the joint representation of the entire multimodal framework. (b) Detailed Structure of proposed adapters, where $L=1+M$ and $h \times w=M$.
  • Figure 4: Ablation experiments for: (a) TED-Adapter, (b) Text-Adapter and (c) Multi-task decoder.