Table of Contents
Fetching ...

Pretrained Image-Text Models are Secretly Video Captioners

Chunhui Zhang, Yiren Jian, Zhongyu Ouyang, Soroush Vosoughi

TL;DR

The paper addresses the high resource demands of video captioning by showing that a pre-trained image-language model (BLIP-2) can be repurposed for video with minimal architectural changes, using frame concatenation and post-training. It identifies three key levers—model scale, data efficiency through extensive image-text pretraining, and reinforcement-learning-based supervision (SCST)—and demonstrates that mid-sized LLMs, with a frozen ViT, achieve competitive results using only about 6,000 video-text pairs. The method attains 2nd place on MSR-VTT and MSVD and 3rd on VATEX, outperforming many specialized video models while requiring far less data and compute. This work offers a practical, open-source recipe for resource-constrained video captioning and points to broader opportunities for pseudolabeling videotext data in low-resource settings.

Abstract

Developing video captioning models is computationally expensive. The dynamic nature of video also complicates the design of multimodal models that can effectively caption these sequences. However, we find that by using minimal computational resources and without complex modifications to address video dynamics, an image-based model can be repurposed to outperform several specialised video captioning systems. Our adapted model demonstrates top tier performance on major benchmarks, ranking 2nd on MSRVTT and MSVD, and 3rd on VATEX. We transform it into a competitive video captioner by post training a typical image captioning model BLIP2 with only 6,000 video text pairs and simply concatenating frames (significantly fewer data than other methods), which use 2.5 to 144 million pairs. From a resource optimization perspective, this video captioning study focuses on three fundamental factors: optimizing model scale, maximizing data efficiency, and incorporating reinforcement learning. This extensive study demonstrates that a lightweight, image based adaptation strategy can rival state-of-the-art video captioning systems, offering a practical solution for low-resource scenarios.

Pretrained Image-Text Models are Secretly Video Captioners

TL;DR

The paper addresses the high resource demands of video captioning by showing that a pre-trained image-language model (BLIP-2) can be repurposed for video with minimal architectural changes, using frame concatenation and post-training. It identifies three key levers—model scale, data efficiency through extensive image-text pretraining, and reinforcement-learning-based supervision (SCST)—and demonstrates that mid-sized LLMs, with a frozen ViT, achieve competitive results using only about 6,000 video-text pairs. The method attains 2nd place on MSR-VTT and MSVD and 3rd on VATEX, outperforming many specialized video models while requiring far less data and compute. This work offers a practical, open-source recipe for resource-constrained video captioning and points to broader opportunities for pseudolabeling videotext data in low-resource settings.

Abstract

Developing video captioning models is computationally expensive. The dynamic nature of video also complicates the design of multimodal models that can effectively caption these sequences. However, we find that by using minimal computational resources and without complex modifications to address video dynamics, an image-based model can be repurposed to outperform several specialised video captioning systems. Our adapted model demonstrates top tier performance on major benchmarks, ranking 2nd on MSRVTT and MSVD, and 3rd on VATEX. We transform it into a competitive video captioner by post training a typical image captioning model BLIP2 with only 6,000 video text pairs and simply concatenating frames (significantly fewer data than other methods), which use 2.5 to 144 million pairs. From a resource optimization perspective, this video captioning study focuses on three fundamental factors: optimizing model scale, maximizing data efficiency, and incorporating reinforcement learning. This extensive study demonstrates that a lightweight, image based adaptation strategy can rival state-of-the-art video captioning systems, offering a practical solution for low-resource scenarios.

Paper Structure

This paper contains 37 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Key factors in recycling BLIP for video captioning: Model – assessing the scale and trainability of components like the ViT, LLM, and Q-Former; Data – examining the volume, quality, and fusion strategies for image and video-text pairs; Supervision – employing reinforcement learning to align generated captions with human language quality standards (CIDEr).
  • Figure 2: Comparisons of different setups for models on the MSR-VTT dataset: (a) freezing modules, (b) scales of LLMs, (c) usage of image-text pairs in pretrained BLIP-2, and (d) supervision with and without SCST. We also replicate the comparisons and ablations on other datasets (e.g., MSVD and VATEX) in App. \ref{['sec:app:msvd']}.
  • Figure 3: (a) temporal fusion by average v.s. concatenation; (b) different resolutions.
  • Figure 4: Training curves of the video captioning model on MSR-VTT, with different module freezing configurations. The vision backbone is ViT, and the language backbone is FLAN-T5. The curves represent three settings: (a) ViT frozen, (b) only Q-former trainable, and (c) all components trainable.
  • Figure 5: Training curves of a video captioning model with different sizes of LLMs. (a), (b), and (c) show training curves of LLMs with sizes 2.7B, 3B, and 7B respectively.
  • ...and 6 more figures