Table of Contents
Fetching ...

Beyond Caption-Based Queries for Video Moment Retrieval

David Pujol-Perich, Albert Clapés, Dima Damen, Sergio Escalera, Michael Wray

TL;DR

This work investigates the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries, and identifies a critical issue in these architectures -- an active decoder-query collapse -- as a primary cause of the poor generalization to multi-moment instances.

Abstract

In this work, we investigate the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries. For this, we introduce three benchmarks by modifying the textual queries in three public VMR datasets -- i.e., HD-EPIC, YouCook2 and ActivityNet-Captions. Our analysis reveals two key generalization challenges: (i) A language gap, arising from the linguistic under-specification of search queries, and (ii) a multi-moment gap, caused by the shift from single-moment to multi-moment queries. We also identify a critical issue in these architectures -- an active decoder-query collapse -- as a primary cause of the poor generalization to multi-moment instances. We mitigate this issue with architectural modifications that effectively increase the number of active decoder queries. Extensive experiments demonstrate that our approach improves performance on search queries by up to 14.82% mAP_m, and up to 21.83% mAP_m on multi-moment search queries. The code, models and data are available in the project webpage: https://davidpujol.github.io/beyond-vmr/

Beyond Caption-Based Queries for Video Moment Retrieval

TL;DR

This work investigates the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries, and identifies a critical issue in these architectures -- an active decoder-query collapse -- as a primary cause of the poor generalization to multi-moment instances.

Abstract

In this work, we investigate the degradation of existing VMR methods, particularly of DETR architectures, when trained on caption-based queries but evaluated on search queries. For this, we introduce three benchmarks by modifying the textual queries in three public VMR datasets -- i.e., HD-EPIC, YouCook2 and ActivityNet-Captions. Our analysis reveals two key generalization challenges: (i) A language gap, arising from the linguistic under-specification of search queries, and (ii) a multi-moment gap, caused by the shift from single-moment to multi-moment queries. We also identify a critical issue in these architectures -- an active decoder-query collapse -- as a primary cause of the poor generalization to multi-moment instances. We mitigate this issue with architectural modifications that effectively increase the number of active decoder queries. Extensive experiments demonstrate that our approach improves performance on search queries by up to 14.82% mAP_m, and up to 21.83% mAP_m on multi-moment search queries. The code, models and data are available in the project webpage: https://davidpujol.github.io/beyond-vmr/
Paper Structure (44 sections, 22 equations, 23 figures, 30 tables)

This paper contains 44 sections, 22 equations, 23 figures, 30 tables.

Figures (23)

  • Figure 1: After watching a video, annotators write detailed, visually-informed captions that map to a single GT moment. However, at inference time, users formulate less detailed, visually-uninformed search queries that often map to multiple GT moments.
  • Figure 2: Overview of the search-query pipeline. Each of the caption is first processed by an agent that generates per-query under-specifications, which are validated by a second identical agent and manually re-annotated if abnormal. Individual queries mapping to the same under-specified query are then grouped, and a final agent produces a representative search query per group.
  • Figure 3: Evaluation of the representative models on both the original datasets and their corresponding search query extensions.
  • Figure 4: Performance degradation for CG-DETR on caption versus search-based evaluation for the "single" and "multi" splits.
  • Figure 5: Visualization of the active query collapse on HD-EPIC-S2 for the base CG-DETR and our method -SA+QD.
  • ...and 18 more figures