EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval
Thomas Hummel, Shyamgopal Karthik, Mariana-Iuliana Georgescu, Zeynep Akata
TL;DR
CVR requires retrieving a target video conditioned on a reference video and a textual modification, a task demanding strong temporal understanding often lacking in existing benchmarks. EgoCVR provides a manually curated egocentric dataset with 2,295 queries and 9k clips to stress fine-grained temporal changes, and demonstrates that prior CVR models struggle on this setting. The authors propose a training-free approach, TF-CVR, augmented by a two-stage re-ranking (TFR-CVR) that first narrows the gallery by visual similarity and then re-ranks with a text-based target caption generated via an LLM, achieving strong results on EgoCVR. The work highlights the importance of temporal information, introduces a broadly applicable re-ranking strategy, and offers a high-quality benchmark to spur development in temporal action understanding for CVR. Together, EgoCVR and TFR-CVR provide a valuable foundation for evaluating and improving temporally-aware cross-modal video retrieval in egocentric settings, with potential implications for real-world video search and interactive retrieval systems.
Abstract
In Composed Video Retrieval, a video and a textual description which modifies the video content are provided as inputs to the model. The aim is to retrieve the relevant video with the modified content from a database of videos. In this challenging task, the first step is to acquire large-scale training datasets and collect high-quality benchmarks for evaluation. In this work, we introduce EgoCVR, a new evaluation benchmark for fine-grained Composed Video Retrieval using large-scale egocentric video datasets. EgoCVR consists of 2,295 queries that specifically focus on high-quality temporal video understanding. We find that existing Composed Video Retrieval frameworks do not achieve the necessary high-quality temporal video understanding for this task. To address this shortcoming, we adapt a simple training-free method, propose a generic re-ranking framework for Composed Video Retrieval, and demonstrate that this achieves strong results on EgoCVR. Our code and benchmark are freely available at https://github.com/ExplainableML/EgoCVR.
