Distilling Vision-Language Models on Millions of Videos

Yue Zhao; Long Zhao; Xingyi Zhou; Jialin Wu; Chun-Te Chu; Hui Miao; Florian Schroff; Hartwig Adam; Ting Liu; Boqing Gong; Philipp Krähenbühl; Liangzhe Yuan

Distilling Vision-Language Models on Millions of Videos

Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Krähenbühl, Liangzhe Yuan

TL;DR

The paper tackles the scarcity of large-scale video-text data by transferring an image-based vision-language model to the video domain through a two-stage adaptation: first aligning the visual encoder to dynamic video content, then adapting the language component with instruction-following data generated by an LLM. It further harnesses machine-generated pseudo-captions to train a CLIP-style dual-encoder, achieving state-of-the-art zero-shot performance on multiple video-language benchmarks and producing the largest-known video caption dataset. The approach demonstrates strong temporal and causal reasoning in video descriptions while significantly reducing the need for human-annotated video captions, highlighting scalable avenues for video-language pretraining and evaluation.

Abstract

The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%. As a side product, we generate the largest video caption dataset to date.

Distilling Vision-Language Models on Millions of Videos

TL;DR

Abstract

Paper Structure (25 sections, 2 equations, 6 figures, 15 tables)

This paper contains 25 sections, 2 equations, 6 figures, 15 tables.

Introduction
Related Work
Preliminaries and Notations
Method: Adapting VLMs to Videos
Model
Two-Stage Adaptation
Experiments
Datasets
Harnessing the Distilled Pseudo-Captions
Main Results
Ablation Studies
Conclusion
Instruction-Following Templates
Examples of Video Captioning
Dataset Details
...and 10 more sections

Figures (6)

Figure 1: Our video-language model takes a video along with any form of instruction as input and generates text according to the instruction. It generates textual descriptions with multiple granularities, including static appearance, general action, and detailed body movements. In contrast, raw alt-text can be erroneous; image captioners fail to capture the action; video captioners prefer outputting short text. Our generated data trains a significantly better video-language dual-encoder model. Best viewed in color.
Figure 2: Overview of adapting vision-language models to videos. In the first stage of visual adaptation on sequences of video frames, we fine-tune the vision encoder while freezing the language model using a video dataset with captions. In the second stage of language adaptation, we freeze the vision encoder while fine-tuning the language model using a video dataset with instruction-following data, e.g. a question that requires temporal reasoning to answer in this example.
Figure 3: An example of the instruction-following data. The first block shows the detailed captions used to prompt an LLM (PaLM 2 palm2 in our case), and the following two blocks show the LLM's responses. We show the keyframes in the top block for illustration purpose and do not use them while prompting the LLM. Different details in text are highlighted. Best viewed in color.
Figure 4: An example of video captions by PaLI-3 before and after video-specific adaptation. We show the keyframes on top for illustration purposes and the generated captions in the following blocks. Different details in text are highlighted. Best viewed in color.
Figure 5: Scaling effect of video captioning. For VLM-generated captions, the zero-shot video retrieval performance consistently improves with respect to an increasing amount of video data. Pre-training on retrieved alt-text quickly stagnates.
...and 1 more figures

Distilling Vision-Language Models on Millions of Videos

TL;DR

Abstract

Distilling Vision-Language Models on Millions of Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (6)