Distilling Vision-Language Models on Millions of Videos
Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Krähenbühl, Liangzhe Yuan
TL;DR
The paper tackles the scarcity of large-scale video-text data by transferring an image-based vision-language model to the video domain through a two-stage adaptation: first aligning the visual encoder to dynamic video content, then adapting the language component with instruction-following data generated by an LLM. It further harnesses machine-generated pseudo-captions to train a CLIP-style dual-encoder, achieving state-of-the-art zero-shot performance on multiple video-language benchmarks and producing the largest-known video caption dataset. The approach demonstrates strong temporal and causal reasoning in video descriptions while significantly reducing the need for human-annotated video captions, highlighting scalable avenues for video-language pretraining and evaluation.
Abstract
The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%. As a side product, we generate the largest video caption dataset to date.
