Table of Contents
Fetching ...

From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

Animesh Gupta, Jay Parmar, Ishan Rajendrakumar Dave, Mubarak Shah

TL;DR

TF-CoVR tackles temporally fine-grained composed video retrieval by introducing a sports-focused benchmark with multiple ground-truth targets per query and a two-stage temporal learning framework (TF-CoVR-Base). The dataset is built from FineGym and FineDiving, featuring 306 actions across gymnastics and diving and 180K training triplets, enabling evaluation of subtle motion changes and apparatus-context variations. The method first learns temporally discriminative video representations via an AIM encoder and then aligns query-modification pairs to target videos through contrastive fusion with text embeddings, achieving significant gains in both zero-shot and fine-tuned settings ($mAP@50$). This work demonstrates the necessity of temporal grounding for CoVR, benchmarks GME models in this setting, and lays groundwork for real-world applications like sports highlight generation, with publicly released embeddings for reproducibility.

Abstract

Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving, and provides 180K triplets drawn from FineGym and FineDiving datasets. Previous CoVR benchmarks, focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each <query, modification> pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics, we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 27.22.

From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

TL;DR

TF-CoVR tackles temporally fine-grained composed video retrieval by introducing a sports-focused benchmark with multiple ground-truth targets per query and a two-stage temporal learning framework (TF-CoVR-Base). The dataset is built from FineGym and FineDiving, featuring 306 actions across gymnastics and diving and 180K training triplets, enabling evaluation of subtle motion changes and apparatus-context variations. The method first learns temporally discriminative video representations via an AIM encoder and then aligns query-modification pairs to target videos through contrastive fusion with text embeddings, achieving significant gains in both zero-shot and fine-tuned settings (). This work demonstrates the necessity of temporal grounding for CoVR, benchmarks GME models in this setting, and lays groundwork for real-world applications like sports highlight generation, with publicly released embeddings for reproducibility.

Abstract

Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving, and provides 180K triplets drawn from FineGym and FineDiving datasets. Previous CoVR benchmarks, focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each <query, modification> pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics, we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 27.22.

Paper Structure

This paper contains 22 sections, 11 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Comparison of composed-retrieval triplets in WebVid-CoVR, Ego-CVR, and TF-CoVR. (a) WebVid-CoVR targets appearance changes. (b) Ego-CVR selects the target clip from a different time-stamp of the same video, showing a new interaction with the same object. (c) TF-CoVR supports two fine-grained modification types: temporal change- varying sub-actions within the same event (row 3), and event change- the same sub-action performed on different apparatuses (row 4).
  • Figure 2: Overview of our automatic triplet generation pipeline for TF-CoVR. We start with temporally labeled clips from FineGym and FineDiving datasets. Using CLIP-based text embeddings, we compute similarity between temporal labels and form pairs with high semantic similarity. These label pairs are passed to GPT-4o along with in-context examples to generate natural language modifications describing the temporal differences between them. Each generated triplet consists of a query video, a target video, and a modification text capturing fine-grained temporal action changes.
  • Figure 3: Overview of TF-CoVR-Base framework. Stage 1 learns temporal video representations via supervised classification using the AIM encoder. In Stage 2, the pretrained AIM and BLIP encoders are frozen, and a projection layer and MLP are trained to align the query-modification pair with the target video using contrastive loss.During inference, the model retrieves relevant videos from TF-CoVR based on a user-provided query video and textual modification.
  • Figure 4: Qualitative results for the composed video retrieval task using our two-stage TF-CoVR-Base model. Each column showcases a query video (top), the corresponding modification instruction (middle), and the top-3 retrieved target videos (ranks 1–3) based on model predictions. TF-CoVR-Base effectively captures subtle temporal variations and retrieves the correct target video at higher ranks. In contrast, the baseline method BLIPCoVR-ECDE often fails to identify the correct action class or resolve fine-grained temporal differences, as indicated by the errors highlighted in red.
  • Figure A1: Label-wise video count distribution in the FineGym subset of TF-CoVR. A logarithmic scale is used on the y-axis to highlight the steep drop in video counts per label due to the smaller dataset size. Note that only a subset of all labels is shown for clarity.
  • ...and 5 more figures