Table of Contents
Fetching ...

Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Zien Xie, Youyao Jia, Sidan Du

TL;DR

The paper tackles video moment retrieval and highlight detection by exploiting multi-modal visual signals—RGB, optical flow, and depth—alongside hierarchical language understanding. It introduces MRNet, a modular framework with a multi-modal fusion module, a query refinement module, a cross-attention transformer, and a decoder-free encoder, trained with a joint loss that pairs saliency prediction and span grounding via Hungarian matching. The approach yields state-of-the-art results on QVHighlights and Charades-STA, including notable improvements in MR mAP and HD HIT@1, by learning complementary visual cues and hierarchical linguistic semantics. This work demonstrates that integrating diverse visual modalities with layered textual representations can substantially enhance video grounding performance and efficiency for MR&HD tasks.

Abstract

Given a video and a linguistic query, video moment retrieval and highlight detection (MR&HD) aim to locate all the relevant spans while simultaneously predicting saliency scores. Most existing methods utilize RGB images as input, overlooking the inherent multi-modal visual signals like optical flow and depth. In this paper, we propose a Multi-modal Fusion and Query Refinement Network (MRNet) to learn complementary information from multi-modal cues. Specifically, we design a multi-modal fusion module to dynamically combine RGB, optical flow, and depth map. Furthermore, to simulate human understanding of sentences, we introduce a query refinement module that merges text at different granularities, containing word-, phrase-, and sentence-wise levels. Comprehensive experiments on QVHighlights and Charades datasets indicate that MRNet outperforms current state-of-the-art methods, achieving notable improvements in MR-mAP@Avg (+3.41) and HD-HIT@1 (+3.46) on QVHighlights.

Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

TL;DR

The paper tackles video moment retrieval and highlight detection by exploiting multi-modal visual signals—RGB, optical flow, and depth—alongside hierarchical language understanding. It introduces MRNet, a modular framework with a multi-modal fusion module, a query refinement module, a cross-attention transformer, and a decoder-free encoder, trained with a joint loss that pairs saliency prediction and span grounding via Hungarian matching. The approach yields state-of-the-art results on QVHighlights and Charades-STA, including notable improvements in MR mAP and HD HIT@1, by learning complementary visual cues and hierarchical linguistic semantics. This work demonstrates that integrating diverse visual modalities with layered textual representations can substantially enhance video grounding performance and efficiency for MR&HD tasks.

Abstract

Given a video and a linguistic query, video moment retrieval and highlight detection (MR&HD) aim to locate all the relevant spans while simultaneously predicting saliency scores. Most existing methods utilize RGB images as input, overlooking the inherent multi-modal visual signals like optical flow and depth. In this paper, we propose a Multi-modal Fusion and Query Refinement Network (MRNet) to learn complementary information from multi-modal cues. Specifically, we design a multi-modal fusion module to dynamically combine RGB, optical flow, and depth map. Furthermore, to simulate human understanding of sentences, we introduce a query refinement module that merges text at different granularities, containing word-, phrase-, and sentence-wise levels. Comprehensive experiments on QVHighlights and Charades datasets indicate that MRNet outperforms current state-of-the-art methods, achieving notable improvements in MR-mAP@Avg (+3.41) and HD-HIT@1 (+3.46) on QVHighlights.
Paper Structure (14 sections, 5 equations, 5 figures, 6 tables)

This paper contains 14 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: (a) An depiction of MR&HD. (b) Depth information enhances the model to understand static scene. (c) Optical flow reinforces the model to reason about dynamic scene.
  • Figure 2: Overview of Multi-modal Fusion and Query Refinement Network (MRNet).
  • Figure 3: The multi-modal fusion module (MFM) aggregates RGB, optical flow, and depth features to enhance dynamic scene reasoning and improve static scene understanding.
  • Figure 4: The query refinement module (QRM) integrates textual features at different levels.
  • Figure 5: Qualitative comparison of the results on QVHighlights val split.