CoVR-2: Automatic Data Construction for Composed Video Retrieval
Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol
TL;DR
The paper tackles the scalability barrier in composed video retrieval (CoVR) by automatically generating train triplets from large video-caption collections. It pairs captions that differ by one word, uses a fine-tuned MTG-LLM to describe the modification, and trains a CoVR-BLIP-2 model with an additional caption-retrieval loss to leverage both video and caption supervision. The authors release WebVid-CoVR (1.6M triplets), WebVid-CoVR-Test, and CC-CoIR (3.3M CoIR triplets) and demonstrate strong zero-shot transfer to CoIR benchmarks (CIRR, FashionIQ, CIRCO) while achieving state-of-the-art results. The work also provides extensive ablations and qualitative analyses, showing the effectiveness of real-data training and the benefits of combining visual and text cues for retrieval. Overall, the approach enables scalable, cross-domain composed retrieval with practical impact for multimodal search tasks.
Abstract
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/~ventural/covr/.
