Table of Contents
Fetching ...

M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

Xingning Dong, Zipeng Feng, Chunluan Zhou, Xuzheng Yu, Ming Yang, Qingpei Guo

TL;DR

M2-RAAP tackles the core bottlenecks of adaptation-based video-text pre-training by (1) filtering and rewriting the data to produce high-quality bilingual 1M English and 1M Chinese video-text pairs, (2) replacing raw videos with 8 key-frames to cut pre-training time, (3) incorporating temporal modeling (e.g., STAN) to capture video dynamics, and (4) enhancing video features through Mug and the new Auxiliary-Caption-Guided (ACG) module. The result is a substantial reduction in data and compute ($90\%$ and $95\%$ respectively) while achieving state-of-the-art zero-shot video-text retrieval across multiple English and Chinese benchmarks, demonstrated on several backbones (CLIP, AltCLIP, M$^{2}$-Encoder). These advances are achieved with a clear, reproducible recipe and extensive ablations that quantify the contributions of data curation, input modality, temporal modeling, and feature enhancement. The work offers a practical path toward efficient, multilingual, zero-shot cross-modal retrieval in real-world settings.

Abstract

We present a Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards effective and efficient zero-shot video-text retrieval, dubbed M2-RAAP. Upon popular image-text models like CLIP, most current adaptation-based video-text pre-training methods are confronted by three major issues, i.e., noisy data corpus, time-consuming pre-training, and limited performance gain. Towards this end, we conduct a comprehensive study including four critical steps in video-text pre-training. Specifically, we investigate 1) data filtering and refinement, 2) video input type selection, 3) temporal modeling, and 4) video feature enhancement. We then summarize this empirical study into the M2-RAAP recipe, where our technical contributions lie in 1) the data filtering and text re-writing pipeline resulting in 1M high-quality bilingual video-text pairs, 2) the replacement of video inputs with key-frames to accelerate pre-training, and 3) the Auxiliary-Caption-Guided (ACG) strategy to enhance video features. We conduct extensive experiments by adapting three image-text foundation models on two refined video-text datasets from different languages, validating the robustness and reproducibility of M2-RAAP for adaptation-based pre-training. Results demonstrate that M2-RAAP yields superior performance with significantly reduced data (-90%) and time consumption (-95%), establishing a new SOTA on four English zero-shot retrieval datasets and two Chinese ones. We are preparing our refined bilingual data annotations and codebase, which will be available at https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/M2_RAAP.

M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

TL;DR

M2-RAAP tackles the core bottlenecks of adaptation-based video-text pre-training by (1) filtering and rewriting the data to produce high-quality bilingual 1M English and 1M Chinese video-text pairs, (2) replacing raw videos with 8 key-frames to cut pre-training time, (3) incorporating temporal modeling (e.g., STAN) to capture video dynamics, and (4) enhancing video features through Mug and the new Auxiliary-Caption-Guided (ACG) module. The result is a substantial reduction in data and compute ( and respectively) while achieving state-of-the-art zero-shot video-text retrieval across multiple English and Chinese benchmarks, demonstrated on several backbones (CLIP, AltCLIP, M-Encoder). These advances are achieved with a clear, reproducible recipe and extensive ablations that quantify the contributions of data curation, input modality, temporal modeling, and feature enhancement. The work offers a practical path toward efficient, multilingual, zero-shot cross-modal retrieval in real-world settings.

Abstract

We present a Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards effective and efficient zero-shot video-text retrieval, dubbed M2-RAAP. Upon popular image-text models like CLIP, most current adaptation-based video-text pre-training methods are confronted by three major issues, i.e., noisy data corpus, time-consuming pre-training, and limited performance gain. Towards this end, we conduct a comprehensive study including four critical steps in video-text pre-training. Specifically, we investigate 1) data filtering and refinement, 2) video input type selection, 3) temporal modeling, and 4) video feature enhancement. We then summarize this empirical study into the M2-RAAP recipe, where our technical contributions lie in 1) the data filtering and text re-writing pipeline resulting in 1M high-quality bilingual video-text pairs, 2) the replacement of video inputs with key-frames to accelerate pre-training, and 3) the Auxiliary-Caption-Guided (ACG) strategy to enhance video features. We conduct extensive experiments by adapting three image-text foundation models on two refined video-text datasets from different languages, validating the robustness and reproducibility of M2-RAAP for adaptation-based pre-training. Results demonstrate that M2-RAAP yields superior performance with significantly reduced data (-90%) and time consumption (-95%), establishing a new SOTA on four English zero-shot retrieval datasets and two Chinese ones. We are preparing our refined bilingual data annotations and codebase, which will be available at https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/M2_RAAP.
Paper Structure (27 sections, 17 equations, 6 figures, 6 tables)

This paper contains 27 sections, 17 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: We propose M$^{2}$-RAAP, a multi-modal recipe for effective and efficient zero-shot video-text retrieval. Specifically, M$^{2}$-RAAP 1) filters and refines video-text pairs to improve the data quality, 2) adopts key-frames as video inputs to reduce pre-training time, and 3) introduces temporal modeling and video feature enhancement to promote pre-training performance. Compared with the baselines, M$^{2}$-RAAP employs only 10$\%$ of data volume (10M $\rightarrow$ 1M) and consumes only 5$\%$ of pre-training time (1920h $\rightarrow$ 92h), reaching a new SOTA on four English downstream zero-shot video-text retrieval datasets and two Chinese ones.
  • Figure 2: Two examples that demonstrate the importance of temporal modeling and video feature enhancement.
  • Figure 3: The pipeline of M$^{2}$-RAAP. M$^{2}$-RAAP employs a progressive expansion scheme to evaluate the contributions of each component. We illustrate the architectures of the overall pre-training framework (top-left part), STAN module (middle-center part), Mug head (top-right part), and our proposed ACG strategy (top-center part).
  • Figure 4: An example of the automatic data filtering and text re-writing pipeline on the English WebVid-10M dataset.
  • Figure 5: An example of the automatic data filtering and text re-writing pipeline on the Chinese Youku-mPLUG-10M dataset.
  • ...and 1 more figures