Table of Contents
Fetching ...

Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

Peijun Bao, Chenqi Kong, Zihao Shao, Boon Poh Ng, Meng Hwa Er, Alex C. Kot

TL;DR

This work tackles the heavy reliance on manual annotations in video moment retrieval by introducing Vid-Morp, a large-scale in-the-wild dataset with 50K+ videos and 200K pseudo annotations generated via GPT-4o. It proposes ReCorrect, a two-phase pretraining framework: semantics-guided refinement to filter and adjust pseudo labels, and memory-consensus correction to calibrate temporal boundaries using a memory bank. Across zero-shot, unsupervised, and fully supervised settings, ReCorrect demonstrates strong generalization, achieving about 75–80% of fully supervised performance in zero-shot and around 85% in unsupervised on Charades-STA and ActivityNet Captions, with robust out-of-distribution performance. The authors provide code, data, and pretrained models to enable broader adoption and further reduces annotation costs for VMR.

Abstract

Given a natural language query, video moment retrieval aims to localize the described temporal moment in an untrimmed video. A major challenge of this task is its heavy dependence on labor-intensive annotations for training. Unlike existing works that directly train models on manually curated data, we propose a novel paradigm to reduce annotation costs: pretraining the model on unlabeled, real-world videos. To support this, we introduce Video Moment Retrieval Pretraining (Vid-Morp), a large-scale dataset collected with minimal human intervention, consisting of over 50K videos captured in the wild and 200K pseudo annotations. Direct pretraining on these imperfect pseudo annotations, however, presents significant challenges, including mismatched sentence-video pairs and imprecise temporal boundaries. To address these issues, we propose the ReCorrect algorithm, which comprises two main phases: semantics-guided refinement and memory-consensus correction. The semantics-guided refinement enhances the pseudo labels by leveraging semantic similarity with video frames to clean out unpaired data and make initial adjustments to temporal boundaries. In the following memory-consensus correction phase, a memory bank tracks the model predictions, progressively correcting the temporal boundaries based on consensus within the memory. Comprehensive experiments demonstrate ReCorrect's strong generalization abilities across multiple downstream settings. Zero-shot ReCorrect achieves over 75% and 80% of the best fully-supervised performance on two benchmarks, while unsupervised ReCorrect reaches about 85% on both. The code, dataset, and pretrained models are available at https://github.com/baopj/Vid-Morp.

Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

TL;DR

This work tackles the heavy reliance on manual annotations in video moment retrieval by introducing Vid-Morp, a large-scale in-the-wild dataset with 50K+ videos and 200K pseudo annotations generated via GPT-4o. It proposes ReCorrect, a two-phase pretraining framework: semantics-guided refinement to filter and adjust pseudo labels, and memory-consensus correction to calibrate temporal boundaries using a memory bank. Across zero-shot, unsupervised, and fully supervised settings, ReCorrect demonstrates strong generalization, achieving about 75–80% of fully supervised performance in zero-shot and around 85% in unsupervised on Charades-STA and ActivityNet Captions, with robust out-of-distribution performance. The authors provide code, data, and pretrained models to enable broader adoption and further reduces annotation costs for VMR.

Abstract

Given a natural language query, video moment retrieval aims to localize the described temporal moment in an untrimmed video. A major challenge of this task is its heavy dependence on labor-intensive annotations for training. Unlike existing works that directly train models on manually curated data, we propose a novel paradigm to reduce annotation costs: pretraining the model on unlabeled, real-world videos. To support this, we introduce Video Moment Retrieval Pretraining (Vid-Morp), a large-scale dataset collected with minimal human intervention, consisting of over 50K videos captured in the wild and 200K pseudo annotations. Direct pretraining on these imperfect pseudo annotations, however, presents significant challenges, including mismatched sentence-video pairs and imprecise temporal boundaries. To address these issues, we propose the ReCorrect algorithm, which comprises two main phases: semantics-guided refinement and memory-consensus correction. The semantics-guided refinement enhances the pseudo labels by leveraging semantic similarity with video frames to clean out unpaired data and make initial adjustments to temporal boundaries. In the following memory-consensus correction phase, a memory bank tracks the model predictions, progressively correcting the temporal boundaries based on consensus within the memory. Comprehensive experiments demonstrate ReCorrect's strong generalization abilities across multiple downstream settings. Zero-shot ReCorrect achieves over 75% and 80% of the best fully-supervised performance on two benchmarks, while unsupervised ReCorrect reaches about 85% on both. The code, dataset, and pretrained models are available at https://github.com/baopj/Vid-Morp.

Paper Structure

This paper contains 25 sections, 9 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: A crucial challenge in video moment retrieval is its heavy reliance on extensive manual annotations for training. To overcome this, we introduce a large scale dataset for Video Moment Retrieval Pretraining (Vid-Morp), collected with minimal human involvement. Vid-Morp comprises over 50K in-the-wild videos and 200K pseudo training samples. Models pretrained on Vid-Morp significantly relieve the annotation costs and demonstrate strong generalizability across diverse downstream settings.
  • Figure 2: Illustration of video samples and pseudo-annotations, including sentence queries and temporal boundaries, from the Video Moment Retrieval Pretraining (Vid-Morp) dataset. The dark blue box represents the temporal boundary of the described video moment.
  • Figure 3: Overview of the Refinement and Correction (ReCorrect) algorithm for video moment retrieval pretraining from in-the-wild videos. ReCorrect consists of two key phases: 1) semantics-guided refinement, which leverages semantic similarity to clean noisy psuedo training samples, such as idle videos and unmatched video-query pairs, while initially adjusting temporal boundaries, and 2) memory-consensus correction, where a memory bank tracks model predictions, progressively correcting temporal boundaries based on consensus within the memory.
  • Figure 4: Collected in a scalable, labor-free manner, the Vid-Morp dataset exhibits three common errors in pseudo training samples: 1) idle videos lacking meaningful activity, 2) unmatched video-query pairs where the query event does not appear from the video, and 3) imprecise temporal boundaries where video-query matches are correct but temporal boundaries are inaccurate.
  • Figure 5: Scability of pretraining dataset size.
  • ...and 2 more figures