Table of Contents
Fetching ...

EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval

Thomas Hummel, Shyamgopal Karthik, Mariana-Iuliana Georgescu, Zeynep Akata

TL;DR

CVR requires retrieving a target video conditioned on a reference video and a textual modification, a task demanding strong temporal understanding often lacking in existing benchmarks. EgoCVR provides a manually curated egocentric dataset with 2,295 queries and 9k clips to stress fine-grained temporal changes, and demonstrates that prior CVR models struggle on this setting. The authors propose a training-free approach, TF-CVR, augmented by a two-stage re-ranking (TFR-CVR) that first narrows the gallery by visual similarity and then re-ranks with a text-based target caption generated via an LLM, achieving strong results on EgoCVR. The work highlights the importance of temporal information, introduces a broadly applicable re-ranking strategy, and offers a high-quality benchmark to spur development in temporal action understanding for CVR. Together, EgoCVR and TFR-CVR provide a valuable foundation for evaluating and improving temporally-aware cross-modal video retrieval in egocentric settings, with potential implications for real-world video search and interactive retrieval systems.

Abstract

In Composed Video Retrieval, a video and a textual description which modifies the video content are provided as inputs to the model. The aim is to retrieve the relevant video with the modified content from a database of videos. In this challenging task, the first step is to acquire large-scale training datasets and collect high-quality benchmarks for evaluation. In this work, we introduce EgoCVR, a new evaluation benchmark for fine-grained Composed Video Retrieval using large-scale egocentric video datasets. EgoCVR consists of 2,295 queries that specifically focus on high-quality temporal video understanding. We find that existing Composed Video Retrieval frameworks do not achieve the necessary high-quality temporal video understanding for this task. To address this shortcoming, we adapt a simple training-free method, propose a generic re-ranking framework for Composed Video Retrieval, and demonstrate that this achieves strong results on EgoCVR. Our code and benchmark are freely available at https://github.com/ExplainableML/EgoCVR.

EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval

TL;DR

CVR requires retrieving a target video conditioned on a reference video and a textual modification, a task demanding strong temporal understanding often lacking in existing benchmarks. EgoCVR provides a manually curated egocentric dataset with 2,295 queries and 9k clips to stress fine-grained temporal changes, and demonstrates that prior CVR models struggle on this setting. The authors propose a training-free approach, TF-CVR, augmented by a two-stage re-ranking (TFR-CVR) that first narrows the gallery by visual similarity and then re-ranks with a text-based target caption generated via an LLM, achieving strong results on EgoCVR. The work highlights the importance of temporal information, introduces a broadly applicable re-ranking strategy, and offers a high-quality benchmark to spur development in temporal action understanding for CVR. Together, EgoCVR and TFR-CVR provide a valuable foundation for evaluating and improving temporally-aware cross-modal video retrieval in egocentric settings, with potential implications for real-world video search and interactive retrieval systems.

Abstract

In Composed Video Retrieval, a video and a textual description which modifies the video content are provided as inputs to the model. The aim is to retrieve the relevant video with the modified content from a database of videos. In this challenging task, the first step is to acquire large-scale training datasets and collect high-quality benchmarks for evaluation. In this work, we introduce EgoCVR, a new evaluation benchmark for fine-grained Composed Video Retrieval using large-scale egocentric video datasets. EgoCVR consists of 2,295 queries that specifically focus on high-quality temporal video understanding. We find that existing Composed Video Retrieval frameworks do not achieve the necessary high-quality temporal video understanding for this task. To address this shortcoming, we adapt a simple training-free method, propose a generic re-ranking framework for Composed Video Retrieval, and demonstrate that this achieves strong results on EgoCVR. Our code and benchmark are freely available at https://github.com/ExplainableML/EgoCVR.
Paper Structure (19 sections, 2 equations, 11 figures, 6 tables)

This paper contains 19 sections, 2 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: The goal of the Composed Video Retrieval (CVR) task is to retrieve the correct video using both a query video and a textual video modification instruction that describes the semantic changes required from the query video.
  • Figure 1: Qualitative depiction of failure cases of TFR-CVR. The modular approach of TFR-CVR allows us to trace back failure cases mostly to two main sources: video captioning errors (top) and text-to-video retrieval errors (bottom).
  • Figure 2: EgoCVR focuses to a significantly greater extent on temporal and action-related modifications (blue) as opposed to object-centred modifications (orange) when compared to the previously existing WebVid-CoVR-Test benchmark ventura2023covr.
  • Figure 2: Diversity of actions and environments in EgoCVR.
  • Figure 3: Samples consisting of visual and text queries along with the target video from our test set EgoCVR (top two rows) and WebVid-CoVR-Test set ventura2023covr (bottom row).
  • ...and 6 more figures