Table of Contents
Fetching ...

Composed Video Retrieval via Enriched Context and Discriminative Embeddings

Omkar Thawakar, Muzammal Naseer, Rao Muhammad Anwer, Salman Khan, Michael Felsberg, Mubarak Shah, Fahad Shahbaz Khan

TL;DR

This work tackles composed video retrieval by incorporating query-specific context through detailed language descriptions and learning discriminative embeddings across vision, text, and vision-text modalities. The method uses three inputs—reference video, enriched description, and change text—processed by a shared multi-modal encoder with cross-attention, and trained with hard-negative contrastive losses across multiple target databases. The resulting joint embedding $ ilde f(q,d,t)$ is optimized via $v^* = \underset{v\in V}{\arg\max} \; \mathcal{L}( \tilde f(q,d,t), g(v))$ with $\tilde f(q,d,t) = f(q,t) + f(q,d) + f(e(d),t)$, achieving state-of-the-art performance on WebVid-CoVR and strong zero-shot results on CoIR benchmarks CIRR and FashionIQ. The approach also shows robust transfer learning and benefits from high-quality language descriptions generated by multimodal conversation models, with code and models available for reproducibility.

Abstract

Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases. Existing works predominantly rely on visual queries combined with modification text to distinguish relevant videos. However, such a strategy struggles to fully preserve the rich query-specific context in retrieved target videos and only represents the target video using visual embedding. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information and learns discriminative embeddings of vision only, text only and vision-text for better alignment to accurately retrieve matched target videos. Our proposed framework can be flexibly employed for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on three datasets show that our approach obtains state-of-the-art performance for both CovR and zero-shot CoIR tasks, achieving gains as high as around 7% in terms of recall@K=1 score. Our code, models, detailed language descriptions for WebViD-CoVR dataset are available at \url{https://github.com/OmkarThawakar/composed-video-retrieval}

Composed Video Retrieval via Enriched Context and Discriminative Embeddings

TL;DR

This work tackles composed video retrieval by incorporating query-specific context through detailed language descriptions and learning discriminative embeddings across vision, text, and vision-text modalities. The method uses three inputs—reference video, enriched description, and change text—processed by a shared multi-modal encoder with cross-attention, and trained with hard-negative contrastive losses across multiple target databases. The resulting joint embedding is optimized via with , achieving state-of-the-art performance on WebVid-CoVR and strong zero-shot results on CoIR benchmarks CIRR and FashionIQ. The approach also shows robust transfer learning and benefits from high-quality language descriptions generated by multimodal conversation models, with code and models available for reproducibility.

Abstract

Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases. Existing works predominantly rely on visual queries combined with modification text to distinguish relevant videos. However, such a strategy struggles to fully preserve the rich query-specific context in retrieved target videos and only represents the target video using visual embedding. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information and learns discriminative embeddings of vision only, text only and vision-text for better alignment to accurately retrieve matched target videos. Our proposed framework can be flexibly employed for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on three datasets show that our approach obtains state-of-the-art performance for both CovR and zero-shot CoIR tasks, achieving gains as high as around 7% in terms of recall@K=1 score. Our code, models, detailed language descriptions for WebViD-CoVR dataset are available at \url{https://github.com/OmkarThawakar/composed-video-retrieval}
Paper Structure (12 sections, 4 equations, 5 figures, 6 tables)

This paper contains 12 sections, 4 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison between the baseline CoVR-BLIP ventura2023covr (top row) and our approach (bottom row) on example video samples from the WebVid-CoVR testset. Here, the change text is highlighted in red. We observe that the baseline typically focuses only on the change while ignoring the semantic alignment of the target video with the query input video (e.g., the composed target video in the second example from the left should have change "yellow" reflected on the salient white tulip surrounded by red flowers, as in the query input). However, the retrieved target video loses the context (red flowers surrounding the yellow tulip). This suggests that it is particularly challenging for the model to understand the correspondence between the change text and the relevant target video using only the visual input. In contrast, our retrieved target videos are visually similar to the input query composed with the change text. Our approach leveraging detailed descriptions (highlighted in white boxes) for joint multi-modal embedding alignment encodes the necessary context to alter the composition of the video (e.g., changing the color of the "white flower" in 2nd video to yellow and changing the color of the "sky and clouds" to orange in 3rd video).
  • Figure 2: Our framework comprises three inputs: the reference video, a detailed visual description of an input video, and a change text corresponding to the target video. The input video is encoded by the vision encoder $g$, and the description is encoded by the frozen text encoder $e$. The default tokenizer tokenizes the change text. The encoded input triplet ($q$,$d$,$t$) is then processed by the multi-model encoder ($f$) grounding two inputs a time. The dotted lines shown are going to the cross-attention for grounding. During training, we add outputs of the multi-model encoder to obtain the joint multi-model embedding $\Tilde f(q,d,t)$ that is aligned across three target databases using hard negative contrastive losses (HN-NCE): $\mathcal{L}_{ve}$, $\mathcal{L}_{mme}$, and $\mathcal{L}_{te}$. During inference, our approach can utilize input query or a combination of input query along with its description to retrieve a composed target video.
  • Figure 3: First row: Comparison between the baseline and our approach in terms of proximity of the output embedding with the target videos on WebVid dataset. Here, each data sample represents the projection of the embedding from $\mathbb{R}^m$ to $\mathbb{R}^2$. Our joint multi-modal embeddings leveraging the language information are closer to the target embeddings, compared to the baseline embedding utilizing only the visual input. Second row: the cosine similarity between video embeddings and the WebVid dataset captions (on the left), compared to the similarity between video embeddings and our generated textual descriptions (on the right). Here, the Y-axis corresponds to the number of videos whereas the X-axis denotes the cosine similarity. Our approach utilizing the generated descriptions achieves better alignment with the video embeddings.
  • Figure 4: Qualitative Comparison between default WebVid-CoVR short captions (top row) with our generated detailed descriptions (bottom row) within our framework. The change text is highlighted in red and the text (default short captions in top row, detailed description in bottom row) are highlighted in black. Here in all three examples from CovR-Vid testset, we observe that the default WebVid-CoVR short captions struggle to fully preserve the contextual information in the retrieved target video (top row). In comparison, our approach leveraging detailed descriptions is able to correctly retrieve the target video with most relevant contextual match with reference video (bottom row). For instance, keeping person working while putting him beside garden in video-1, keeping the sea while replacing the beach with mountains in video-2 and keeping the turboprop airplane on airport behind fence in video-3. Best viewed zoomed in. Additional examples are in the suppl.
  • Figure 5: Qualitative Comparison between Pic2Word saito2023pic2word(top row), CoVR-BLIP ventura2023covr (mid-row) and our proposed method (bottom-row) in zero-shot CoIR task. Here in all three examples from CIRR test set, we observe that using only reference image and change text (in red) Pic2Word saito2023pic2word and CoVR-BLIP ventura2023covr struggle to correctly retrieved target video (top and mid row). In comparison, our approach leveraging detailed descriptions is accurately retrieving the target video with most relevant contextual match with reference video (bottom row). Best viewed zoomed in.