Table of Contents
Fetching ...

CoVR-2: Automatic Data Construction for Composed Video Retrieval

Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol

TL;DR

The paper tackles the scalability barrier in composed video retrieval (CoVR) by automatically generating train triplets from large video-caption collections. It pairs captions that differ by one word, uses a fine-tuned MTG-LLM to describe the modification, and trains a CoVR-BLIP-2 model with an additional caption-retrieval loss to leverage both video and caption supervision. The authors release WebVid-CoVR (1.6M triplets), WebVid-CoVR-Test, and CC-CoIR (3.3M CoIR triplets) and demonstrate strong zero-shot transfer to CoIR benchmarks (CIRR, FashionIQ, CIRCO) while achieving state-of-the-art results. The work also provides extensive ablations and qualitative analyses, showing the effectiveness of real-data training and the benefits of combining visual and text cues for retrieval. Overall, the approach enables scalable, cross-domain composed retrieval with practical impact for multimodal search tasks.

Abstract

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/~ventural/covr/.

CoVR-2: Automatic Data Construction for Composed Video Retrieval

TL;DR

The paper tackles the scalability barrier in composed video retrieval (CoVR) by automatically generating train triplets from large video-caption collections. It pairs captions that differ by one word, uses a fine-tuned MTG-LLM to describe the modification, and trains a CoVR-BLIP-2 model with an additional caption-retrieval loss to leverage both video and caption supervision. The authors release WebVid-CoVR (1.6M triplets), WebVid-CoVR-Test, and CC-CoIR (3.3M CoIR triplets) and demonstrate strong zero-shot transfer to CoIR benchmarks (CIRR, FashionIQ, CIRCO) while achieving state-of-the-art results. The work also provides extensive ablations and qualitative analyses, showing the effectiveness of real-data training and the benefits of combining visual and text cues for retrieval. Overall, the approach enables scalable, cross-domain composed retrieval with practical impact for multimodal search tasks.

Abstract

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/~ventural/covr/.
Paper Structure (37 sections, 2 equations, 24 figures, 21 tables)

This paper contains 37 sections, 2 equations, 24 figures, 21 tables.

Figures (24)

  • Figure 1: Task: Composed Video Retrieval (CoVR) seeks to retrieve videos from a database by searching with both a query image and a query text. The text typically specifies the desired modification to the query image. In this example, a traveller might wonder how the photographed place looks like during a fountain show, by describing several modifications, such as "during show at night, with fireworks".
  • Figure 2: Method overview: We automatically mine similar caption pairs from a large video-caption database from the Web, and use our modification text generation language model (MTG-LLM) to describe the difference between the two captions. MTG-LLM is trained on a dataset of 715 triplet text annotations brooks2022instructpix2pix. The resulting triplet with the two corresponding videos (query $q$ and target video $v$) and the modification text ($t$) is therefore obtained fully automatically, allowing a scalable CoVR training data generation.
  • Figure 3: Examples of generated CoVR triplets in WebVid-CoVR: The middle frame of each video is shown with its corresponding caption, with the distinct word highlighted in bold. Additionally, the generated modification text is displayed on top of each pair of videos. The bottom example illustrates a noisy generated modification text, as 'beautiful' is subjective and both target and query videos can be considered as beautiful fields.
  • Figure 4: Model architecture of CoVR-BLIP-2: The BLIP-2 li2023blip2 image encoder extracts visual features from the image query $q$. These visual features are combined with the text query $t$ (modification text) through the BLIP-2 image-grounded text encoder to obtain a multi-modal query embedding $f(q, t)$. To encode videos, $N$ frames are individually encoded with the BLIP-2 image encoder and its Q-Former, and aggregated via a weighted mean into a single video embedding $h(v)$. The goal of CoVR (video retrieval) is to maximize similarity between the multi-modal query $f(q, t)$ and the target video $h(v)$. During training, an additional caption retrieval loss $\mathcal{L_c}$ is defined between $f(q, t)$ and the target caption embedding $g(c)$. Note that for simplicity, we visualize one Q-Former block, but in practice there are 12 blocks as in li2023blip2. To reduce the 32 tokens output by the Q-Former, we simply average them before computing cosine similarities. While each Q-Former is initialized from the BLIP-2 pretraining, they are finetuned on our CoVR/CoIR data. When training for CoIR, the target becomes a single image, removing the need for weighted mean. See Section \ref{['subsec:training']} for more details.
  • Figure 5: Examples of generated triplets:We illustrate triplet samples generated using our automatic dataset creation methodology (left: WebVid-CoVR, right: CC-CoIR). Each sample consists of two videos/images with their corresponding captions (at the bottom of each video/image) and the generated modification text using our MTG-LLM (in purple).
  • ...and 19 more figures