Table of Contents
Fetching ...

Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection

Weijun Zhuang, Qizhang Li, Xin Li, Ming Liu, Xiaopeng Hong, Feng Gao, Fan Yang, Wangmeng Zuo

TL;DR

Grounding-MD tackles open-world moment detection by unifying Temporal Action Detection and Moment Retrieval under a structured prompt framework. It introduces a Cross-Modality Fusion Encoder and a Text-Guided Fusion Decoder to achieve deep video-text alignment, supplemented by a Query-Wise Pooler to stabilize cross-task training. Pretraining on large-scale TAD and MR data yields robust video-text cognition, enabling strong zero-shot and supervised performance across ActivityNet, THUMOS14, ActivityNet-Captions, and Charades-STA. The approach achieves state-of-the-art results with favorable efficiency compared to Video-LLM baselines, highlighting its potential for scalable open-world video understanding.

Abstract

Temporal Action Detection and Moment Retrieval constitute two pivotal tasks in video understanding, focusing on precisely localizing temporal segments corresponding to specific actions or events. Recent advancements introduced Moment Detection to unify these two tasks, yet existing approaches remain confined to closed-set scenarios, limiting their applicability in open-world contexts. To bridge this gap, we present Grounding-MD, an innovative, grounded video-language pre-training framework tailored for open-world moment detection. Our framework incorporates an arbitrary number of open-ended natural language queries through a structured prompt mechanism, enabling flexible and scalable moment detection. Grounding-MD leverages a Cross-Modality Fusion Encoder and a Text-Guided Fusion Decoder to facilitate comprehensive video-text alignment and enable effective cross-task collaboration. Through large-scale pre-training on temporal action detection and moment retrieval datasets, Grounding-MD demonstrates exceptional semantic representation learning capabilities, effectively handling diverse and complex query conditions. Comprehensive evaluations across four benchmark datasets including ActivityNet, THUMOS14, ActivityNet-Captions, and Charades-STA demonstrate that Grounding-MD establishes new state-of-the-art performance in zero-shot and supervised settings in open-world moment detection scenarios. All source code and trained models will be released.

Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection

TL;DR

Grounding-MD tackles open-world moment detection by unifying Temporal Action Detection and Moment Retrieval under a structured prompt framework. It introduces a Cross-Modality Fusion Encoder and a Text-Guided Fusion Decoder to achieve deep video-text alignment, supplemented by a Query-Wise Pooler to stabilize cross-task training. Pretraining on large-scale TAD and MR data yields robust video-text cognition, enabling strong zero-shot and supervised performance across ActivityNet, THUMOS14, ActivityNet-Captions, and Charades-STA. The approach achieves state-of-the-art results with favorable efficiency compared to Video-LLM baselines, highlighting its potential for scalable open-world video understanding.

Abstract

Temporal Action Detection and Moment Retrieval constitute two pivotal tasks in video understanding, focusing on precisely localizing temporal segments corresponding to specific actions or events. Recent advancements introduced Moment Detection to unify these two tasks, yet existing approaches remain confined to closed-set scenarios, limiting their applicability in open-world contexts. To bridge this gap, we present Grounding-MD, an innovative, grounded video-language pre-training framework tailored for open-world moment detection. Our framework incorporates an arbitrary number of open-ended natural language queries through a structured prompt mechanism, enabling flexible and scalable moment detection. Grounding-MD leverages a Cross-Modality Fusion Encoder and a Text-Guided Fusion Decoder to facilitate comprehensive video-text alignment and enable effective cross-task collaboration. Through large-scale pre-training on temporal action detection and moment retrieval datasets, Grounding-MD demonstrates exceptional semantic representation learning capabilities, effectively handling diverse and complex query conditions. Comprehensive evaluations across four benchmark datasets including ActivityNet, THUMOS14, ActivityNet-Captions, and Charades-STA demonstrate that Grounding-MD establishes new state-of-the-art performance in zero-shot and supervised settings in open-world moment detection scenarios. All source code and trained models will be released.

Paper Structure

This paper contains 17 sections, 9 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Illustration of UniMD zeng2024unimd and Grounding-MD. UniMD operates under a closed-set assumption, limiting its applicability in open-world scenarios. In contrast, Grounding-MD supports user-defined action categories and open-ended natural language event descriptions, enabling it to adapt to diverse and dynamic user queries in open-world environments.
  • Figure 2: Advantages of Grounding-MD. (a) The early and late fusion strategies achieve optimal video-text alignment, enabling the model to gain a deeper understanding of the video-text data. Moreover, the structured prompt design allows the model to handle an arbitrary number of open-ended natural language queries, demonstrating excellent cross-task collaboration; (b) Through pre-training on large-scale temporal action detection and moment retrieval datasets, Grounding-MD develops robust semantic representation learning and video-text cognition abilities, enabling it to handle more complex and diverse query inputs.
  • Figure 3: Overview of the Grounding-MD framework. The Cross-Modality Fusion Encoder performs early fusion of video and text features, establishing initial cross-modal alignment, while the Text-Guided Fusion Decoder conducts late fusion, refining the alignment through deeper interaction between modalities. Additionally, the Query-Wise Pooler addresses the training instability caused by the disparity in text lengths between action categories and event descriptions by generating balanced query-wise textual representations.