Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection

Sung Jin Um; Dongjin Kim; Sangmin Lee; Jung Uk Kim

Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection

Sung Jin Um, Dongjin Kim, Sangmin Lee, Jung Uk Kim

TL;DR

This work tackles joint video moment retrieval and highlight detection under natural language queries by introducing a Video Context-aware Keyword Attention (VCKA) framework. VCKA combines a video context clustering module, which forms concise context representations $\mathbf{F}^{cv}$, with a keyword weight detection module that yields context-adjusted keyword features $\mathbf{F}^{wt}$; a keyword-aware contrastive objective then aligns visual and textual modalities. The training objective integrates moment retrieval and highlight detection losses with a keyword-aware contrastive term $\mathcal{L}_{Total} = \mathcal{L}_{mr} + \mathcal{L}_{hd} + \lambda_{kw} \mathcal{L}_{kw}$ where $\lambda_{kw}$ is tuned (0.3), and supports improved cross-modal understanding across datasets. Empirically, the method improves MR/HD performance on QVHighlights, TVSum, and Charades-STA over strong baselines, demonstrating robust handling of keyword variation and video-wide context for practical video search and curation tasks. The work provides a publicly available implementation and highlights the importance of context-driven keyword weighting for video-language tasks.

Abstract

The goal of video moment retrieval and highlight detection is to identify specific segments and highlights based on a given text query. With the rapid growth of video content and the overlap between these tasks, recent works have addressed both simultaneously. However, they still struggle to fully capture the overall video context, making it challenging to determine which words are most relevant. In this paper, we present a novel Video Context-aware Keyword Attention module that overcomes this limitation by capturing keyword variation within the context of the entire video. To achieve this, we introduce a video context clustering module that provides concise representations of the overall video context, thereby enhancing the understanding of keyword dynamics. Furthermore, we propose a keyword weight detection module with keyword-aware contrastive learning that incorporates keyword information to enhance fine-grained alignment between visual and textual features. Extensive experiments on the QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that our proposed method significantly improves performance in moment retrieval and highlight detection tasks compared to existing approaches. Our code is available at: https://github.com/VisualAIKHU/Keyword-DETR

Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection

TL;DR

, with a keyword weight detection module that yields context-adjusted keyword features

; a keyword-aware contrastive objective then aligns visual and textual modalities. The training objective integrates moment retrieval and highlight detection losses with a keyword-aware contrastive term

where

is tuned (0.3), and supports improved cross-modal understanding across datasets. Empirically, the method improves MR/HD performance on QVHighlights, TVSum, and Charades-STA over strong baselines, demonstrating robust handling of keyword variation and video-wide context for practical video search and curation tasks. The work provides a publicly available implementation and highlights the importance of context-driven keyword weighting for video-language tasks.

Abstract

Paper Structure (22 sections, 6 equations, 5 figures, 7 tables)

This paper contains 22 sections, 6 equations, 5 figures, 7 tables.

Introduction
Related Works
Moment Retrieval
Highlight Detection
Proposed Method
Video Context-aware Keyword Attention Module
Video Contextual MR/HD Prediction
Keyword-aware Contrastive Loss
Training Objective
Experiments
Datasets and Evaluation Metrics
Implementation Details
Comparison to Prior Works
Ablation Study
Visualization Results
...and 7 more sections

Figures (5)

Figure 1: Text keywords can vary by video context. The less frequently a word appears in the video clip, the more important it becomes within the text query. In Video #1, 'dog' is important, while in Video #2, 'garden' is important.
Figure 1: Additional visualization comparisons of our method with TR-DETR and UVCOM for moment retrieval (MR) and highlight detection (HD) on the QVHighlights val set.
Figure 2: Overall configuration of our moment retrieval and highlight detection. $\otimes$ indicates element-wise multiplication.
Figure 3: Visualization examples for moment retrieval (MR) and highlight detection (HD) on the QVHighlights val set.
Figure 4: Visualization results of keyword weight effectiveness on QVHighlights val set.

Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection

TL;DR

Abstract

Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (5)