Table of Contents
Fetching ...

See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection

YuEun Lee, Jung Uk Kim

TL;DR

The paper tackles the challenge of moment retrieval and highlight detection by making word-level query importance explicit and grounding it in rich scene understanding. It introduces a feature enhancement module to identify important words and enrich cross-modal representations, alongside a ranking-based filtering module that iteratively narrows down clips by word relevance, all under a modal alignment loss to unify text, video, and caption modalities. By leveraging Multimodal Large Language Models (internVL2) for scene-aware captions and cross-modal cues, the approach achieves state-of-the-art results on QVHighlights, TVSum, and Charades-STA, with thorough ablations and insights into efficiency and caption contribution. The work demonstrates the practical value of word-aware filtering in multimedia understanding, offering a scalable framework for MR/HD that integrates caption knowledge without sacrificing performance.

Abstract

Video moment retrieval (MR) and highlight detection (HD) with natural language queries aim to localize relevant moments and key highlights in a video clips. However, existing methods overlook the importance of individual words, treating the entire text query and video clips as a black-box, which hinders contextual understanding. In this paper, we propose a novel approach that enables fine-grained clip filtering by identifying and prioritizing important words in the query. Our method integrates image-text scene understanding through Multimodal Large Language Models (MLLMs) and enhances the semantic understanding of video clips. We introduce a feature enhancement module (FEM) to capture important words from the query and a ranking-based filtering module (RFM) to iteratively refine video clips based on their relevance to these important words. Extensive experiments demonstrate that our approach significantly outperforms existing state-of-the-art methods, achieving superior performance in both MR and HD tasks. Our code is available at: https://github.com/VisualAIKHU/SRF.

See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection

TL;DR

The paper tackles the challenge of moment retrieval and highlight detection by making word-level query importance explicit and grounding it in rich scene understanding. It introduces a feature enhancement module to identify important words and enrich cross-modal representations, alongside a ranking-based filtering module that iteratively narrows down clips by word relevance, all under a modal alignment loss to unify text, video, and caption modalities. By leveraging Multimodal Large Language Models (internVL2) for scene-aware captions and cross-modal cues, the approach achieves state-of-the-art results on QVHighlights, TVSum, and Charades-STA, with thorough ablations and insights into efficiency and caption contribution. The work demonstrates the practical value of word-aware filtering in multimedia understanding, offering a scalable framework for MR/HD that integrates caption knowledge without sacrificing performance.

Abstract

Video moment retrieval (MR) and highlight detection (HD) with natural language queries aim to localize relevant moments and key highlights in a video clips. However, existing methods overlook the importance of individual words, treating the entire text query and video clips as a black-box, which hinders contextual understanding. In this paper, we propose a novel approach that enables fine-grained clip filtering by identifying and prioritizing important words in the query. Our method integrates image-text scene understanding through Multimodal Large Language Models (MLLMs) and enhances the semantic understanding of video clips. We introduce a feature enhancement module (FEM) to capture important words from the query and a ranking-based filtering module (RFM) to iteratively refine video clips based on their relevance to these important words. Extensive experiments demonstrate that our approach significantly outperforms existing state-of-the-art methods, achieving superior performance in both MR and HD tasks. Our code is available at: https://github.com/VisualAIKHU/SRF.

Paper Structure

This paper contains 25 sections, 17 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Conceptual illustration of our method. First, we aim to (a) find and prioritize important words in a text query, and (b) filter video clips based on the priority of the words.
  • Figure 1: Example captions for the QVHighlights val set. C1 through C6 are examples of captions corresponding to six video clips in order. Clips marked with red boxes and text are ground-truth (GT) clips, while the others are non-GT clips.
  • Figure 2: Overall architecture. Query, visual and caption features are prioritize important words and deepen scene understanding via feature enhancement module, then repeatedly filter irrelevant information to the query via ranking-based filtering module.
  • Figure 2: Visualization comparison of MR and HD on QVHighlights val set. Prediction results are compared to ground truth (GT), TR-DETR trdetr2024, UVCOM uvcom2024 and Keyword-DETR keyworddetr2025. (a) to (d) is examples of correctly predicted results, and (e) to (f) is examples of incorrect prediction results.
  • Figure 3: The detailed process of the ranking-based filtering module (RFM). Video clips are iteratively filtered based on the ranking of the most important query tokens.
  • ...and 2 more figures