MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval

Xiaojie Jin; Bowen Zhang; Weibo Gong; Kai Xu; XueQing Deng; Peng Wang; Zhao Zhang; Xiaohui Shen; Jiashi Feng

MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval

Xiaojie Jin, Bowen Zhang, Weibo Gong, Kai Xu, XueQing Deng, Peng Wang, Zhao Zhang, Xiaohui Shen, Jiashi Feng

TL;DR

MV-Adapter introduces a parameter-efficient approach to video-text retrieval by freezing most of a CLIP backbone and only training a small adapter. It adds a Temporal Adaptation module to capture global and local temporal context and a Cross Modality Tying mechanism to align video/text branches via a shared factor space, enabling effective multimodal learning with minimal parameter growth. Across five standard VTR benchmarks, MV-Adapter achieves results on par with or better than full fine-tuning while using roughly 2.4% additional parameters, and it significantly outperforms competing PETL methods. This work offers a practical path toward scalable, storage-efficient multimodal retrieval in real-world applications, with potential extensions to broader video understanding tasks.

Abstract

State-of-the-art video-text retrieval (VTR) methods typically involve fully fine-tuning a pre-trained model (e.g. CLIP) on specific datasets. However, this can result in significant storage costs in practical applications as a separate model per task must be stored. To address this issue, we present our pioneering work that enables parameter-efficient VTR using a pre-trained model, with only a small number of tunable parameters during training. Towards this goal, we propose a new method dubbed Multimodal Video Adapter (MV-Adapter) for efficiently transferring the knowledge in the pre-trained CLIP from image-text to video-text. Specifically, MV-Adapter utilizes bottleneck structures in both video and text branches, along with two novel components. The first is a Temporal Adaptation Module that is incorporated in the video branch to introduce global and local temporal contexts. We also train weights calibrations to adjust to dynamic variations across frames. The second is Cross Modality Tying that generates weights for video/text branches through sharing cross modality factors, for better aligning between modalities. Thanks to above innovations, MV-Adapter can achieve comparable or better performance than standard full fine-tuning with negligible parameters overhead. Notably, MV-Adapter consistently outperforms various competing methods in V2T/T2V tasks with large margins on five widely used VTR benchmarks (MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet).

MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval

TL;DR

Abstract

Paper Structure (23 sections, 9 equations, 4 figures, 10 tables, 2 algorithms)

This paper contains 23 sections, 9 equations, 4 figures, 10 tables, 2 algorithms.

Introduction
Related Work
Image-text Pre-trained Model
Parameter-Efficient Transfer Learning
Video Text Retrieval
Methodology
Preliminary
Overview of MV-Adapter
Temporal Adaptation
Cross Modality Tying
Efficiency Analysis
Experiment Setup
Datasets and Evaluation Metric
Implementation Detail
Results And Analysis
...and 8 more sections

Figures (4)

Figure 1: The overall pipeline of MV-adapter with the illustration of the basic structure of video/text branches. Only a small part of model is tunable during training, highlighted by the "unlock" symbol.
Figure 2: (a) The overall results on five widely used VTR benchmarks, We present the R@Sum (sum of the R@1, R@5, and R@10) results for the Text-to-Video and Video-to-Text tasks for full fine-tuning, ours, and the best baseline method, displayed as a ratio to the R@Sum of full fine-tuning. (b) Comparison of Text-to-Video and Video-to-Text R@Sum for different methods on MSR-VTT, where the radius of the circle is positively correlated with the trainable parameters.
Figure 3: Illustration of temporal adaptation in visual branch, including temporal modeling using lightweight transformer block (TRM) and temporal calibration to generate dynamic upsample weights for each frame.
Figure 4: Visualizations of text-to-video (top row) and video-to-text (bottom row) results from MV-Adapter and ST-Adapter st using the same query from MSR-VTT. In each example, the retrieval results of baseline and MV-Adapter are shown in red and blue respectively.

MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval

TL;DR

Abstract

MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (4)