Table of Contents
Fetching ...

Disentangle and denoise: Tackling context misalignment for video moment retrieval

Kaijing Ma, Han Fang, Xianghao Zang, Chao Ban, Lanxiang Zhou, Zhongjiang He, Yongxiang Li, Hao Sun, Zerun Feng, Xingsong Hou

TL;DR

Video Moment Retrieval suffers from context misalignment due to uneven semantic distribution and noisy backgrounds. The authors propose CDNet, a two-branch framework consisting of Query-guided Semantic Disentangling (QSD) for global and fine-grained alignment, and Context-aware Dynamic Denoising (CDD) for learnable, query-relevant re-sampling of visual context. QSD uses dual-level contrastive losses to disentangle relevant video-text correlations, while CDD leverages learned offsets to focus on semantically critical moments, followed by a cross-modal grounding stage with targeted losses and Hungarian matching. Empirically, CDNet achieves state-of-the-art results on QVHighlights and competitive performance on Charades-STA and TACoS, demonstrating improved fine-grained grounding and robustness to visual noise.

Abstract

Video Moment Retrieval, which aims to locate in-context video moments according to a natural language query, is an essential task for cross-modal grounding. Existing methods focus on enhancing the cross-modal interactions between all moments and the textual description for video understanding. However, constantly interacting with all locations is unreasonable because of uneven semantic distribution across the timeline and noisy visual backgrounds. This paper proposes a cross-modal Context Denoising Network (CDNet) for accurate moment retrieval by disentangling complex correlations and denoising irrelevant dynamics.Specifically, we propose a query-guided semantic disentanglement (QSD) to decouple video moments by estimating alignment levels according to the global and fine-grained correlation. A Context-aware Dynamic Denoisement (CDD) is proposed to enhance understanding of aligned spatial-temporal details by learning a group of query-relevant offsets. Extensive experiments on public benchmarks demonstrate that the proposed CDNet achieves state-of-the-art performances.

Disentangle and denoise: Tackling context misalignment for video moment retrieval

TL;DR

Video Moment Retrieval suffers from context misalignment due to uneven semantic distribution and noisy backgrounds. The authors propose CDNet, a two-branch framework consisting of Query-guided Semantic Disentangling (QSD) for global and fine-grained alignment, and Context-aware Dynamic Denoising (CDD) for learnable, query-relevant re-sampling of visual context. QSD uses dual-level contrastive losses to disentangle relevant video-text correlations, while CDD leverages learned offsets to focus on semantically critical moments, followed by a cross-modal grounding stage with targeted losses and Hungarian matching. Empirically, CDNet achieves state-of-the-art results on QVHighlights and competitive performance on Charades-STA and TACoS, demonstrating improved fine-grained grounding and robustness to visual noise.

Abstract

Video Moment Retrieval, which aims to locate in-context video moments according to a natural language query, is an essential task for cross-modal grounding. Existing methods focus on enhancing the cross-modal interactions between all moments and the textual description for video understanding. However, constantly interacting with all locations is unreasonable because of uneven semantic distribution across the timeline and noisy visual backgrounds. This paper proposes a cross-modal Context Denoising Network (CDNet) for accurate moment retrieval by disentangling complex correlations and denoising irrelevant dynamics.Specifically, we propose a query-guided semantic disentanglement (QSD) to decouple video moments by estimating alignment levels according to the global and fine-grained correlation. A Context-aware Dynamic Denoisement (CDD) is proposed to enhance understanding of aligned spatial-temporal details by learning a group of query-relevant offsets. Extensive experiments on public benchmarks demonstrate that the proposed CDNet achieves state-of-the-art performances.
Paper Structure (12 sections, 14 equations, 6 figures, 4 tables)

This paper contains 12 sections, 14 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: (a) Our proposed disentangle and denoising method. (b) Results comparison when ambiguity occurs in video.
  • Figure 2: Illustration of CDNet. (1) In Query-guided Semantic Disentangling (QSD), we use a dual-stream contrastive learning approach to disentangle the visual and textual correlations separately by Global and Fine-grained Guidance. (2) Subsequently, visual and textual features are fed into the Context-aware Dynamic Denoising(CDD) for refined temporal and multi-modal context comprehension. MHA denotes multi-head attention, CMHA signifies corrected multi-head attention, and Offset represents a lightweight convolutional network to guide the model in visual feature re-sampling. (3) The concatenated features are then fed into the Localizer for the ultimate localization.
  • Figure 3: Illustration of offset network.
  • Figure 4:
  • Figure 5:
  • ...and 1 more figures