Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

Peijun Bao; Chenqi Kong; Zihao Shao; Boon Poh Ng; Meng Hwa Er; Alex C. Kot

Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

Peijun Bao, Chenqi Kong, Zihao Shao, Boon Poh Ng, Meng Hwa Er, Alex C. Kot

TL;DR

This work tackles the heavy reliance on manual annotations in video moment retrieval by introducing Vid-Morp, a large-scale in-the-wild dataset with 50K+ videos and 200K pseudo annotations generated via GPT-4o. It proposes ReCorrect, a two-phase pretraining framework: semantics-guided refinement to filter and adjust pseudo labels, and memory-consensus correction to calibrate temporal boundaries using a memory bank. Across zero-shot, unsupervised, and fully supervised settings, ReCorrect demonstrates strong generalization, achieving about 75–80% of fully supervised performance in zero-shot and around 85% in unsupervised on Charades-STA and ActivityNet Captions, with robust out-of-distribution performance. The authors provide code, data, and pretrained models to enable broader adoption and further reduces annotation costs for VMR.

Abstract

Given a natural language query, video moment retrieval aims to localize the described temporal moment in an untrimmed video. A major challenge of this task is its heavy dependence on labor-intensive annotations for training. Unlike existing works that directly train models on manually curated data, we propose a novel paradigm to reduce annotation costs: pretraining the model on unlabeled, real-world videos. To support this, we introduce Video Moment Retrieval Pretraining (Vid-Morp), a large-scale dataset collected with minimal human intervention, consisting of over 50K videos captured in the wild and 200K pseudo annotations. Direct pretraining on these imperfect pseudo annotations, however, presents significant challenges, including mismatched sentence-video pairs and imprecise temporal boundaries. To address these issues, we propose the ReCorrect algorithm, which comprises two main phases: semantics-guided refinement and memory-consensus correction. The semantics-guided refinement enhances the pseudo labels by leveraging semantic similarity with video frames to clean out unpaired data and make initial adjustments to temporal boundaries. In the following memory-consensus correction phase, a memory bank tracks the model predictions, progressively correcting the temporal boundaries based on consensus within the memory. Comprehensive experiments demonstrate ReCorrect's strong generalization abilities across multiple downstream settings. Zero-shot ReCorrect achieves over 75% and 80% of the best fully-supervised performance on two benchmarks, while unsupervised ReCorrect reaches about 85% on both. The code, dataset, and pretrained models are available at https://github.com/baopj/Vid-Morp.

Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

TL;DR

Abstract

Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)