Table of Contents
Fetching ...

Video sentence grounding with temporally global textual knowledge

Cai Chen, Runzhong Zhang, Jianjun Gao, Kejun Wu, Kim-Hui Yap, Yi Wang

TL;DR

This work tackles temporal sentence grounding by addressing the cross-modal domain gap between video and temporally localized language. It introduces a Pseudo-query Intermediary Network (PIN) that learns from temporally global pseudo-queries and a learnable PQ-prompt to propagate global textual knowledge into the textual encoder and fusion module, enabling stronger visual–language alignment. The approach yields state-of-the-art results on Charades-STA and ActivityNet-Captions, with ablations confirming the effectiveness of PIN and PQ-prompt in improving boundary prediction, especially at higher IoU thresholds. By integrating contrastive learning with prompt-guided fusion, the method demonstrates a scalable way to leverage global textual knowledge for more precise temporal grounding, with clear implications for cross-modal video understanding tasks.

Abstract

Temporal sentence grounding involves the retrieval of a video moment with a natural language query. Many existing works directly incorporate the given video and temporally localized query for temporal grounding, overlooking the inherent domain gap between different modalities. In this paper, we utilize pseudo-query features containing extensive temporally global textual knowledge sourced from the same video-query pair, to enhance the bridging of domain gaps and attain a heightened level of similarity between multi-modal features. Specifically, we propose a Pseudo-query Intermediary Network (PIN) to achieve an improved alignment of visual and comprehensive pseudo-query features within the feature space through contrastive learning. Subsequently, we utilize learnable prompts to encapsulate the knowledge of pseudo-queries, propagating them into the textual encoder and multi-modal fusion module, further enhancing the feature alignment between visual and language for better temporal grounding. Extensive experiments conducted on the Charades-STA and ActivityNet-Captions datasets demonstrate the effectiveness of our method.

Video sentence grounding with temporally global textual knowledge

TL;DR

This work tackles temporal sentence grounding by addressing the cross-modal domain gap between video and temporally localized language. It introduces a Pseudo-query Intermediary Network (PIN) that learns from temporally global pseudo-queries and a learnable PQ-prompt to propagate global textual knowledge into the textual encoder and fusion module, enabling stronger visual–language alignment. The approach yields state-of-the-art results on Charades-STA and ActivityNet-Captions, with ablations confirming the effectiveness of PIN and PQ-prompt in improving boundary prediction, especially at higher IoU thresholds. By integrating contrastive learning with prompt-guided fusion, the method demonstrates a scalable way to leverage global textual knowledge for more precise temporal grounding, with clear implications for cross-modal video understanding tasks.

Abstract

Temporal sentence grounding involves the retrieval of a video moment with a natural language query. Many existing works directly incorporate the given video and temporally localized query for temporal grounding, overlooking the inherent domain gap between different modalities. In this paper, we utilize pseudo-query features containing extensive temporally global textual knowledge sourced from the same video-query pair, to enhance the bridging of domain gaps and attain a heightened level of similarity between multi-modal features. Specifically, we propose a Pseudo-query Intermediary Network (PIN) to achieve an improved alignment of visual and comprehensive pseudo-query features within the feature space through contrastive learning. Subsequently, we utilize learnable prompts to encapsulate the knowledge of pseudo-queries, propagating them into the textual encoder and multi-modal fusion module, further enhancing the feature alignment between visual and language for better temporal grounding. Extensive experiments conducted on the Charades-STA and ActivityNet-Captions datasets demonstrate the effectiveness of our method.
Paper Structure (14 sections, 14 equations, 4 figures, 5 tables)

This paper contains 14 sections, 14 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: (a) Most existing methods 2DTANHCLVSLNetTCFN directly integrate the given video and temporally localized query features, overlooking the inherent domain gaps between different modalities. (b) We introduce a PIN network to further bridge domain gaps, incorporating additional comprehensive temporally global textual knowledge that enhances the overall similarity between multi-modal features. In addition, we utilize the learnable prompt to enhance the alignment learning of the multi-modal fusion module.
  • Figure 2: (a) an overview of our proposed method. (b) illustrated the comprehensive pseudo-queries generation process based on the untrimmed video and language query. Subsequently, as illustrated in c, we employ a Pseudo-query Intermediary Network (PIN) to effectively bridge the multi-modal domain gap by contrastively learning the visual and pseudo-query features to improve the similarity of visual and language features. Furthermore, the learnable PQ-prompt $\mathbf{\mathcal{P}}$ outputs from the self-attention layer, which encompasses global textual knowledge, are integrated into the textual encoder to enhance feature alignment in the multi-modal fusion module (shown in Figure \ref{['fig_3']}). At the inference stage, pseudo-queries are replaced by the ground truth query to retrieve the relevant PQ-prompt from the prompt pool, enhancing the target moment prediction.
  • Figure 3: The illustration of Prompt Guided Multi-modal Fusion module.
  • Figure 4: Qualitative analysis with the baseline EMB on Charades-STA (top) and ActivityNet (bottom) test set.