From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

Animesh Gupta; Jay Parmar; Ishan Rajendrakumar Dave; Mubarak Shah

From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

Animesh Gupta, Jay Parmar, Ishan Rajendrakumar Dave, Mubarak Shah

TL;DR

TF-CoVR tackles temporally fine-grained composed video retrieval by introducing a sports-focused benchmark with multiple ground-truth targets per query and a two-stage temporal learning framework (TF-CoVR-Base). The dataset is built from FineGym and FineDiving, featuring 306 actions across gymnastics and diving and 180K training triplets, enabling evaluation of subtle motion changes and apparatus-context variations. The method first learns temporally discriminative video representations via an AIM encoder and then aligns query-modification pairs to target videos through contrastive fusion with text embeddings, achieving significant gains in both zero-shot and fine-tuned settings ($mAP@50$). This work demonstrates the necessity of temporal grounding for CoVR, benchmarks GME models in this setting, and lays groundwork for real-world applications like sports highlight generation, with publicly released embeddings for reproducibility.

Abstract

Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving, and provides 180K triplets drawn from FineGym and FineDiving datasets. Previous CoVR benchmarks, focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each <query, modification> pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics, we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 27.22.

From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

TL;DR

Abstract

From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)