Table of Contents
Fetching ...

Saliency-Guided DETR for Moment Retrieval and Highlight Detection

Aleksandr Gordeev, Vladimir Dokholyan, Irina Tolstykh, Maksim Kuprashevich

TL;DR

This work targets robust video moment retrieval and highlight detection by enhancing cross-modal alignment through a Saliency-Guided DETR (SG-DETR) framework. It introduces Saliency-Guided Cross Attention and a local-to-global saliency refinement, plus a hybrid ATSS-DETR detector and an IoU-based localization head to improve span accuracy. To address data scarcity, the authors create InterVid-MR, a 150k-sample pretraining dataset, and demonstrate strong zero-shot and finetuned performance on QVHighlights, Charades-STA, and TACoS, achieving state-of-the-art results. The approach offers a scalable, end-to-end solution for video-language grounding with practical impact for production systems and multimodal search tasks.

Abstract

Existing approaches for video moment retrieval and highlight detection are not able to align text and video features efficiently, resulting in unsatisfying performance and limited production usage. To address this, we propose a novel architecture that utilizes recent foundational video models designed for such alignment. Combined with the introduced Saliency-Guided Cross Attention mechanism and a hybrid DETR architecture, our approach significantly enhances performance in both moment retrieval and highlight detection tasks. For even better improvement, we developed InterVid-MR, a large-scale and high-quality dataset for pretraining. Using it, our architecture achieves state-of-the-art results on the QVHighlights, Charades-STA and TACoS benchmarks. The proposed approach provides an efficient and scalable solution for both zero-shot and fine-tuning scenarios in video-language tasks.

Saliency-Guided DETR for Moment Retrieval and Highlight Detection

TL;DR

This work targets robust video moment retrieval and highlight detection by enhancing cross-modal alignment through a Saliency-Guided DETR (SG-DETR) framework. It introduces Saliency-Guided Cross Attention and a local-to-global saliency refinement, plus a hybrid ATSS-DETR detector and an IoU-based localization head to improve span accuracy. To address data scarcity, the authors create InterVid-MR, a 150k-sample pretraining dataset, and demonstrate strong zero-shot and finetuned performance on QVHighlights, Charades-STA, and TACoS, achieving state-of-the-art results. The approach offers a scalable, end-to-end solution for video-language grounding with practical impact for production systems and multimodal search tasks.

Abstract

Existing approaches for video moment retrieval and highlight detection are not able to align text and video features efficiently, resulting in unsatisfying performance and limited production usage. To address this, we propose a novel architecture that utilizes recent foundational video models designed for such alignment. Combined with the introduced Saliency-Guided Cross Attention mechanism and a hybrid DETR architecture, our approach significantly enhances performance in both moment retrieval and highlight detection tasks. For even better improvement, we developed InterVid-MR, a large-scale and high-quality dataset for pretraining. Using it, our architecture achieves state-of-the-art results on the QVHighlights, Charades-STA and TACoS benchmarks. The proposed approach provides an efficient and scalable solution for both zero-shot and fine-tuning scenarios in video-language tasks.
Paper Structure (33 sections, 18 equations, 2 figures, 6 tables)

This paper contains 33 sections, 18 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Overall architecture of our framework. Detailed explanations of notations are described in \ref{['section:mr']}
  • Figure 2: The impact of pre-train dataset size on MR mAP@avg metric on the QVHighlights validation set.