Table of Contents
Fetching ...

MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval

Huaying Yuan, Jian Ni, Zheng Liu, Yueze Wang, Junjie Zhou, Zhengyang Liang, Bo Zhao, Zhao Cao, Zhicheng Dou, Ji-Rong Wen

TL;DR

MomentSeeker introduces a dedicated long-video moment retrieval benchmark to address the gap between LVU understanding and precise temporal grounding. By deploying a diverse, hour-scale video collection with a three-level task taxonomy and multimodal queries, it enables evaluation of both retrieval-based and generation-based approaches on grounding accuracy and efficiency. The experimental results reveal persistent challenges in fine-grained temporal localization, especially under multi-modal queries and long contexts, while also showing that longer temporal context and model scaling can improve performance. Public release of MomentSeeker aims to catalyze progress in temporal grounding and scalable LVU systems, bridging the gap between high-level understanding and precise moment localization.

Abstract

Accurately locating key moments within long videos is crucial for solving long video understanding (LVU) tasks. However, existing benchmarks are either severely limited in terms of video length and task diversity, or they focus solely on the end-to-end LVU performance, making them inappropriate for evaluating whether key moments can be accurately accessed. To address this challenge, we propose MomentSeeker, a novel benchmark for long-video moment retrieval (LMVR), distinguished by the following features. First, it is created based on long and diverse videos, averaging over 1200 seconds in duration and collected from various domains, e.g., movie, anomaly, egocentric, and sports. Second, it covers a variety of real-world scenarios in three levels: global-level, event-level, object-level, covering common tasks like action recognition, object localization, and causal reasoning, etc. Third, it incorporates rich forms of queries, including text-only queries, image-conditioned queries, and video-conditioned queries. On top of MomentSeeker, we conduct comprehensive experiments for both generation-based approaches (directly using MLLMs) and retrieval-based approaches (leveraging video retrievers). Our results reveal the significant challenges in long-video moment retrieval in terms of accuracy and efficiency, despite improvements from the latest long-video MLLMs and task-specific fine-tuning. We have publicly released MomentSeeker(https://yhy-2000.github.io/MomentSeeker/) to facilitate future research in this area.

MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval

TL;DR

MomentSeeker introduces a dedicated long-video moment retrieval benchmark to address the gap between LVU understanding and precise temporal grounding. By deploying a diverse, hour-scale video collection with a three-level task taxonomy and multimodal queries, it enables evaluation of both retrieval-based and generation-based approaches on grounding accuracy and efficiency. The experimental results reveal persistent challenges in fine-grained temporal localization, especially under multi-modal queries and long contexts, while also showing that longer temporal context and model scaling can improve performance. Public release of MomentSeeker aims to catalyze progress in temporal grounding and scalable LVU systems, bridging the gap between high-level understanding and precise moment localization.

Abstract

Accurately locating key moments within long videos is crucial for solving long video understanding (LVU) tasks. However, existing benchmarks are either severely limited in terms of video length and task diversity, or they focus solely on the end-to-end LVU performance, making them inappropriate for evaluating whether key moments can be accurately accessed. To address this challenge, we propose MomentSeeker, a novel benchmark for long-video moment retrieval (LMVR), distinguished by the following features. First, it is created based on long and diverse videos, averaging over 1200 seconds in duration and collected from various domains, e.g., movie, anomaly, egocentric, and sports. Second, it covers a variety of real-world scenarios in three levels: global-level, event-level, object-level, covering common tasks like action recognition, object localization, and causal reasoning, etc. Third, it incorporates rich forms of queries, including text-only queries, image-conditioned queries, and video-conditioned queries. On top of MomentSeeker, we conduct comprehensive experiments for both generation-based approaches (directly using MLLMs) and retrieval-based approaches (leveraging video retrievers). Our results reveal the significant challenges in long-video moment retrieval in terms of accuracy and efficiency, despite improvements from the latest long-video MLLMs and task-specific fine-tuning. We have publicly released MomentSeeker(https://yhy-2000.github.io/MomentSeeker/) to facilitate future research in this area.

Paper Structure

This paper contains 28 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Demonstrative examples of the MomentSeeker benchmark. Dashed boxes denote the sources of image $q_I$ and video $q_V$ in the multi-modal queries, while solid boxes indicate the ground truth moment(s). Red circles mark key queried information.
  • Figure 2: Dataset statistics. (a). Question type distribution, (b). Video duration distribution across samples, and (c) Answering time range length distribution across samples. MomentSeeker has a full spectrum of video length and covers different core abilities of the moment retrieval task.
  • Figure 3: Examples of each task. Dashed boxes show sources of query image $q_I$ and video $q_V$; solid boxes mark ground truth moments. Red circles highlight key queried information.
  • Figure 4: Sub-task performance of different retrieval-based methods and generation-based methods.
  • Figure 5: Evaluation results w.r.t. query modalities: TMR (text-only), IMR (image-conditioned), and VMR (video-conditioned) moment retrieval.
  • ...and 5 more figures