M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
Xingning Dong, Zipeng Feng, Chunluan Zhou, Xuzheng Yu, Ming Yang, Qingpei Guo
TL;DR
M2-RAAP tackles the core bottlenecks of adaptation-based video-text pre-training by (1) filtering and rewriting the data to produce high-quality bilingual 1M English and 1M Chinese video-text pairs, (2) replacing raw videos with 8 key-frames to cut pre-training time, (3) incorporating temporal modeling (e.g., STAN) to capture video dynamics, and (4) enhancing video features through Mug and the new Auxiliary-Caption-Guided (ACG) module. The result is a substantial reduction in data and compute ($90\%$ and $95\%$ respectively) while achieving state-of-the-art zero-shot video-text retrieval across multiple English and Chinese benchmarks, demonstrated on several backbones (CLIP, AltCLIP, M$^{2}$-Encoder). These advances are achieved with a clear, reproducible recipe and extensive ablations that quantify the contributions of data curation, input modality, temporal modeling, and feature enhancement. The work offers a practical path toward efficient, multilingual, zero-shot cross-modal retrieval in real-world settings.
Abstract
We present a Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards effective and efficient zero-shot video-text retrieval, dubbed M2-RAAP. Upon popular image-text models like CLIP, most current adaptation-based video-text pre-training methods are confronted by three major issues, i.e., noisy data corpus, time-consuming pre-training, and limited performance gain. Towards this end, we conduct a comprehensive study including four critical steps in video-text pre-training. Specifically, we investigate 1) data filtering and refinement, 2) video input type selection, 3) temporal modeling, and 4) video feature enhancement. We then summarize this empirical study into the M2-RAAP recipe, where our technical contributions lie in 1) the data filtering and text re-writing pipeline resulting in 1M high-quality bilingual video-text pairs, 2) the replacement of video inputs with key-frames to accelerate pre-training, and 3) the Auxiliary-Caption-Guided (ACG) strategy to enhance video features. We conduct extensive experiments by adapting three image-text foundation models on two refined video-text datasets from different languages, validating the robustness and reproducibility of M2-RAAP for adaptation-based pre-training. Results demonstrate that M2-RAAP yields superior performance with significantly reduced data (-90%) and time consumption (-95%), establishing a new SOTA on four English zero-shot retrieval datasets and two Chinese ones. We are preparing our refined bilingual data annotations and codebase, which will be available at https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/M2_RAAP.
