Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection
Sung Jin Um, Dongjin Kim, Sangmin Lee, Jung Uk Kim
TL;DR
This work tackles joint video moment retrieval and highlight detection under natural language queries by introducing a Video Context-aware Keyword Attention (VCKA) framework. VCKA combines a video context clustering module, which forms concise context representations $\mathbf{F}^{cv}$, with a keyword weight detection module that yields context-adjusted keyword features $\mathbf{F}^{wt}$; a keyword-aware contrastive objective then aligns visual and textual modalities. The training objective integrates moment retrieval and highlight detection losses with a keyword-aware contrastive term $\mathcal{L}_{Total} = \mathcal{L}_{mr} + \mathcal{L}_{hd} + \lambda_{kw} \mathcal{L}_{kw}$ where $\lambda_{kw}$ is tuned (0.3), and supports improved cross-modal understanding across datasets. Empirically, the method improves MR/HD performance on QVHighlights, TVSum, and Charades-STA over strong baselines, demonstrating robust handling of keyword variation and video-wide context for practical video search and curation tasks. The work provides a publicly available implementation and highlights the importance of context-driven keyword weighting for video-language tasks.
Abstract
The goal of video moment retrieval and highlight detection is to identify specific segments and highlights based on a given text query. With the rapid growth of video content and the overlap between these tasks, recent works have addressed both simultaneously. However, they still struggle to fully capture the overall video context, making it challenging to determine which words are most relevant. In this paper, we present a novel Video Context-aware Keyword Attention module that overcomes this limitation by capturing keyword variation within the context of the entire video. To achieve this, we introduce a video context clustering module that provides concise representations of the overall video context, thereby enhancing the understanding of keyword dynamics. Furthermore, we propose a keyword weight detection module with keyword-aware contrastive learning that incorporates keyword information to enhance fine-grained alignment between visual and textual features. Extensive experiments on the QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that our proposed method significantly improves performance in moment retrieval and highlight detection tasks compared to existing approaches. Our code is available at: https://github.com/VisualAIKHU/Keyword-DETR
