Table of Contents
Fetching ...

Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding

Mengzhao Wang, Huafeng Li, Yafei Zhang, Jinxing Li, Minghong Xie, Dapeng Tao

TL;DR

This work proposes a Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding method (DMR-JRG), which achieves precise cross-modal matching and grounding by exploring the consistency between local, global, and temporal dimensions of video segments and textual paragraphs.

Abstract

Video Paragraph Grounding (VPG) aims to precisely locate the most appropriate moments within a video that are relevant to a given textual paragraph query. However, existing methods typically rely on large-scale annotated temporal labels and assume that the correspondence between videos and paragraphs is known. This is impractical in real-world applications, as constructing temporal labels requires significant labor costs, and the correspondence is often unknown. To address this issue, we propose a Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding method (DMR-JRG). In this method, retrieval and grounding tasks are mutually reinforced rather than being treated as separate issues. DMR-JRG mainly consists of two branches: a retrieval branch and a grounding branch. The retrieval branch uses inter-video contrastive learning to roughly align the global features of paragraphs and videos, reducing modality differences and constructing a coarse-grained feature space to break free from the need for correspondence between paragraphs and videos. Additionally, this coarse-grained feature space further facilitates the grounding branch in extracting fine-grained contextual representations. In the grounding branch, we achieve precise cross-modal matching and grounding by exploring the consistency between local, global, and temporal dimensions of video segments and textual paragraphs. By synergizing these dimensions, we construct a fine-grained feature space for video and textual features, greatly reducing the need for large-scale annotated temporal labels.

Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding

TL;DR

This work proposes a Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding method (DMR-JRG), which achieves precise cross-modal matching and grounding by exploring the consistency between local, global, and temporal dimensions of video segments and textual paragraphs.

Abstract

Video Paragraph Grounding (VPG) aims to precisely locate the most appropriate moments within a video that are relevant to a given textual paragraph query. However, existing methods typically rely on large-scale annotated temporal labels and assume that the correspondence between videos and paragraphs is known. This is impractical in real-world applications, as constructing temporal labels requires significant labor costs, and the correspondence is often unknown. To address this issue, we propose a Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding method (DMR-JRG). In this method, retrieval and grounding tasks are mutually reinforced rather than being treated as separate issues. DMR-JRG mainly consists of two branches: a retrieval branch and a grounding branch. The retrieval branch uses inter-video contrastive learning to roughly align the global features of paragraphs and videos, reducing modality differences and constructing a coarse-grained feature space to break free from the need for correspondence between paragraphs and videos. Additionally, this coarse-grained feature space further facilitates the grounding branch in extracting fine-grained contextual representations. In the grounding branch, we achieve precise cross-modal matching and grounding by exploring the consistency between local, global, and temporal dimensions of video segments and textual paragraphs. By synergizing these dimensions, we construct a fine-grained feature space for video and textual features, greatly reducing the need for large-scale annotated temporal labels.

Paper Structure

This paper contains 26 sections, 23 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: An illustrative example of Video Sentence Retrieval and Grounding (VSRG) and Video Paragraph Retrieval and Grounding (VPRG). (a) VSRG aims to retrieve the corresponding video from a video corpus along with its specific moments using a single sentence as a query. (b) VPRG aims to retrieve the relevant video from a video corpus through a paragraph query and to locate the exact moments of each sentence in the paragraph within the video.
  • Figure 2: Details of the dual-task mutual reinforcing framework. VPR stands for Video Paragraph Retrieval, and VPG stands for Video Paragraph Grounding.
  • Figure 3: Overview of the proposed method. It comprises there core parts: Firstly, Feature Extraction, which includes video feature extraction and text feature extraction. Secondly, Retrieval Branch, consists of the Video-Paragraph Cross Modal Retrieval (VPCMR) component. Lastly, Grounding Branch, which includes three components: Visual-Textual Consistency on local dimension (VTC-LD), Visual-Textual Feature Alignment on global dimension (VTFA-GD), and Bidirectional Temporal Synchronization of Events on temporal dimension (BTSE-TD). By combining the strengths of these three parts, we mitigate the differences between visual and textual modalities, ensuring that the text accurately corresponds to events in the video. This effectively achieves video paragraph retrieval and grounding.
  • Figure 4: Details of the candidate moments feature fusion(CMFF). Index refers to indexing the corresponding visual features in the fused temporal feature map ${\bar{\bm{F}}}_m$ based on the top Q candidate moments.
  • Figure 5: Diagram of the chronological order of events in a video.
  • ...and 5 more figures