Table of Contents
Fetching ...

Learning text-to-video retrieval from image captioning

Lucas Ventura, Cordelia Schmid, Gül Varol

TL;DR

The paper tackles text-to-video retrieval in a setting with unlabeled videos by using off-the-shelf image captioners to generate frame-level captions, which are filtered by CLIPScore and used as pseudo-labels. It introduces a multi-caption query-scoring framework that temporally pools frame embeddings and optimizes a symmetric contrastive objective to align video and caption representations, enabling training without manual video annotations. Empirical results on ActivityNet, MSR-VTT, and MSVD show consistent improvements over zero-shot CLIP baselines, with BLIP-based initialization and combining multiple captioners and datasets yielding robust gains. The approach offers a practical pathway for domain adaptation of video-language models using abundant image-caption data, with potential extensions to more captioners and larger unlabeled video collections.

Abstract

We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD.

Learning text-to-video retrieval from image captioning

TL;DR

The paper tackles text-to-video retrieval in a setting with unlabeled videos by using off-the-shelf image captioners to generate frame-level captions, which are filtered by CLIPScore and used as pseudo-labels. It introduces a multi-caption query-scoring framework that temporally pools frame embeddings and optimizes a symmetric contrastive objective to align video and caption representations, enabling training without manual video annotations. Empirical results on ActivityNet, MSR-VTT, and MSVD show consistent improvements over zero-shot CLIP baselines, with BLIP-based initialization and combining multiple captioners and datasets yielding robust gains. The approach offers a practical pathway for domain adaptation of video-language models using abundant image-caption data, with potential extensions to more captioners and larger unlabeled video collections.

Abstract

We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD.
Paper Structure (23 sections, 5 equations, 8 figures, 17 tables)

This paper contains 23 sections, 5 equations, 8 figures, 17 tables.

Figures (8)

  • Figure 1: Framework: Instead of using the ground-truth video caption, we extract image captions to automatically label unlabeled video frames, which we filter to obtain high-quality captions. The selected captions from multiple image captioners are incorporated into a text-to-video retrieval training where each video is paired with multiple caption labels.
  • Figure 2: Caption selection and multi-caption query-scoring (MCQS): (a) To select the best captions for a given video, we first extract image captions from both ClipCap ClipCap and BLIP li2022blip models for $M$ number of frames. We then compute the CLIPScore max2022Hitchhiker (gray box), and finally select Top $K=2$ captions for each captioner: $c_1$ and $c_2$ for ClipCap (highlighted in green), and $c_3$ and $c_4$ for BLIP (highlighted in blue). (b) MCQS takes a caption embedding $\bar{c}_l$ and weights the frame embeddings $\bar{v}_1 ... \bar{v}_N$ according to the query-scoring temporal poooling function $f_p$ to obtain a video representation $\widetilde{v}_l$. Finally, we simply average the four similarities obtained with their respective query-scoring.
  • Figure 3: Qualitative results: We provide video retrieval results for our best model trained with the combination of the three datasets. The examples belong to the test sets of ActivityNet (first two rows), MSR-VTT (third and fourth rows), and MSVD (last two rows). For each example, we show the text query, the ground-truth video (first column, blue border), and top 5 retrieved videos from the gallery. Each video is only displayed using the middle frame, with a green border if matches the ground-truth video, or a red border otherwise. Overall, even cases where the correct video is not retrieved at the first rank, all the retrieved videos have similar semantic meaning with the text query.
  • Figure A.1: CLIPScore kernel density estimate: We plot the CLIPScore distribution for three datasets, and both models (ClipCap and BLIP). CLIPScore is higher for ClipCap than for BLIP, potentially because of the CLIP backbone.
  • Figure A.2: Combining captioners: We compare 4 different strategies: selecting 2 from 10 ClipCap captions, selecting 2 from 10 BLIP captions, selecting Top 4 from the 20 combined captions, selecting Top 2 from each captioner. We highlight the best performance with a black border.
  • ...and 3 more figures