Table of Contents
Fetching ...

Video Editing for Video Retrieval

Bin Zhu, Kevin Flanagan, Adriano Fragomeni, Michael Wray, Dima Damen

TL;DR

This work tackles the high cost of collecting precise video clip boundaries for text-to-video retrieval by leveraging single timestamps as weak supervision. It introduces a two-stage approach: a warm-up phase that trains a cross-modal retriever from rough, timestamp-derived clips, and a subsequent teacher-student co-training framework that edits clip boundaries and learns from edited clips to improve retrieval accuracy. The method is model-agnostic and validated across three retrieval architectures (COOT, VideoCLIP, CLIP4Clip) and three datasets (YouCook2, DiDeMo, ActivityNet-Captions), yielding consistent gains over timestamp baselines and approaching upper-bound performance with ground-truth boundaries. Human studies corroborate that edited clips better reflect caption content, underscoring practical potential for scalable video retrieval with weak supervision. Overall, the paper offers a feasible path to improve clip retrieval without exhaustive manual annotation by enabling iterative boundary refinement through mutual learning between editing and retrieval.

Abstract

Though pre-training vision-language models have demonstrated significant benefits in boosting video-text retrieval performance from large-scale web videos, fine-tuning still plays a critical role with manually annotated clips with start and end times, which requires considerable human effort. To address this issue, we explore an alternative cheaper source of annotations, single timestamps, for video-text retrieval. We initialise clips from timestamps in a heuristic way to warm up a retrieval model. Then a video clip editing method is proposed to refine the initial rough boundaries to improve retrieval performance. A student-teacher network is introduced for video clip editing. The teacher model is employed to edit the clips in the training set whereas the student model trains on the edited clips. The teacher weights are updated from the student's after the student's performance increases. Our method is model agnostic and applicable to any retrieval models. We conduct experiments based on three state-of-the-art retrieval models, COOT, VideoCLIP and CLIP4Clip. Experiments conducted on three video retrieval datasets, YouCook2, DiDeMo and ActivityNet-Captions show that our edited clips consistently improve retrieval performance over initial clips across all the three retrieval models.

Video Editing for Video Retrieval

TL;DR

This work tackles the high cost of collecting precise video clip boundaries for text-to-video retrieval by leveraging single timestamps as weak supervision. It introduces a two-stage approach: a warm-up phase that trains a cross-modal retriever from rough, timestamp-derived clips, and a subsequent teacher-student co-training framework that edits clip boundaries and learns from edited clips to improve retrieval accuracy. The method is model-agnostic and validated across three retrieval architectures (COOT, VideoCLIP, CLIP4Clip) and three datasets (YouCook2, DiDeMo, ActivityNet-Captions), yielding consistent gains over timestamp baselines and approaching upper-bound performance with ground-truth boundaries. Human studies corroborate that edited clips better reflect caption content, underscoring practical potential for scalable video retrieval with weak supervision. Overall, the paper offers a feasible path to improve clip retrieval without exhaustive manual annotation by enabling iterative boundary refinement through mutual learning between editing and retrieval.

Abstract

Though pre-training vision-language models have demonstrated significant benefits in boosting video-text retrieval performance from large-scale web videos, fine-tuning still plays a critical role with manually annotated clips with start and end times, which requires considerable human effort. To address this issue, we explore an alternative cheaper source of annotations, single timestamps, for video-text retrieval. We initialise clips from timestamps in a heuristic way to warm up a retrieval model. Then a video clip editing method is proposed to refine the initial rough boundaries to improve retrieval performance. A student-teacher network is introduced for video clip editing. The teacher model is employed to edit the clips in the training set whereas the student model trains on the edited clips. The teacher weights are updated from the student's after the student's performance increases. Our method is model agnostic and applicable to any retrieval models. We conduct experiments based on three state-of-the-art retrieval models, COOT, VideoCLIP and CLIP4Clip. Experiments conducted on three video retrieval datasets, YouCook2, DiDeMo and ActivityNet-Captions show that our edited clips consistently improve retrieval performance over initial clips across all the three retrieval models.
Paper Structure (13 sections, 4 equations, 8 figures, 3 tables)

This paper contains 13 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Given videos with rough timestamp supervision, we propose a method to edit clips increasing video retrieval performance.
  • Figure 2: Overview of the method. Left, we warm up a text-video retrieval model from initial clips $v_i^0$ in the first stage, $P_v$ and $P_c$ refer to clip and caption encoders respectively. Centre, a teacher edits clips to have maximum similarity with their corresponding captions. A student model learns on the edited clips and once performance increases, the teacher's weights are updated with that of the student's. Right, we edit a video by selecting the top $k$ similar segments with the corresponding caption, candidates are creating using these as start/end times and the clip with the highest average IoU is chosen as the new clip.
  • Figure 3: The percentage of clip editing in the training set across three datasets using COOT coot. We present histograms showing the IoU between initial and edited clips (a) and between ground truth and edited clips (b).
  • Figure 4: Results from a human study evaluating the correspondence between initial and edited clips to their captions across three video retrieval datasets. The percentage of choices favoring the better clips is reported.
  • Figure 5: Qualitative examples showing caption-to-clip retrieval with initial (orange) and edited (blue) clips from different videos, where the timestamps are in seconds. The first four examples show edited clips are superior to initial clips from human perception, and the last two demonstrate the edited clips are inferior to initial clips. Video clips are presented on the project webpage.
  • ...and 3 more figures