Table of Contents
Fetching ...

Knowledge-Refined Dual Context-Aware Network for Partially Relevant Video Retrieval

Junkai Yang, Qirui Wang, Yaoqing Jin, Shuai Ma, Minghan Xu, Shanmin Pang

Abstract

Retrieving partially relevant segments from untrimmed videos remains difficult due to two persistent challenges: the mismatch in information density between text and video segments, and limited attention mechanisms that overlook semantic focus and event correlations. We present KDC-Net, a Knowledge-Refined Dual Context-Aware Network that tackles these issues from both textual and visual perspectives. On the text side, a Hierarchical Semantic Aggregation module captures and adaptively fuses multi-scale phrase cues to enrich query semantics. On the video side, a Dynamic Temporal Attention mechanism employs relative positional encoding and adaptive temporal windows to highlight key events with local temporal coherence. Additionally, a dynamic CLIP-based distillation strategy, enhanced with temporal-continuity-aware refinement, ensures segment-aware and objective-aligned knowledge transfer. Experiments on PRVR benchmarks show that KDC-Net consistently outperforms state-of-the-art methods, especially under low moment-to-video ratios.

Knowledge-Refined Dual Context-Aware Network for Partially Relevant Video Retrieval

Abstract

Retrieving partially relevant segments from untrimmed videos remains difficult due to two persistent challenges: the mismatch in information density between text and video segments, and limited attention mechanisms that overlook semantic focus and event correlations. We present KDC-Net, a Knowledge-Refined Dual Context-Aware Network that tackles these issues from both textual and visual perspectives. On the text side, a Hierarchical Semantic Aggregation module captures and adaptively fuses multi-scale phrase cues to enrich query semantics. On the video side, a Dynamic Temporal Attention mechanism employs relative positional encoding and adaptive temporal windows to highlight key events with local temporal coherence. Additionally, a dynamic CLIP-based distillation strategy, enhanced with temporal-continuity-aware refinement, ensures segment-aware and objective-aligned knowledge transfer. Experiments on PRVR benchmarks show that KDC-Net consistently outperforms state-of-the-art methods, especially under low moment-to-video ratios.
Paper Structure (21 sections, 11 equations, 3 figures, 3 tables)

This paper contains 21 sections, 11 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The Overview diagram illustrates the task objectives of PRVR, and also shows that the Knowledge Refined Distillation strategy we proposed can effectively optimize the distilled signals obtained from the teacher model.
  • Figure 2: Illustration of KDC-Net. It employs a distillation framework, the student model comprises two independent branches with no parameter sharing.
  • Figure 3: Ablation studies: (a) KRD window size ablation; (b) DTA parameters ablation; (c) $\delta$ and $\lambda$ parameters ablation.