Video sentence grounding with temporally global textual knowledge
Cai Chen, Runzhong Zhang, Jianjun Gao, Kejun Wu, Kim-Hui Yap, Yi Wang
TL;DR
This work tackles temporal sentence grounding by addressing the cross-modal domain gap between video and temporally localized language. It introduces a Pseudo-query Intermediary Network (PIN) that learns from temporally global pseudo-queries and a learnable PQ-prompt to propagate global textual knowledge into the textual encoder and fusion module, enabling stronger visual–language alignment. The approach yields state-of-the-art results on Charades-STA and ActivityNet-Captions, with ablations confirming the effectiveness of PIN and PQ-prompt in improving boundary prediction, especially at higher IoU thresholds. By integrating contrastive learning with prompt-guided fusion, the method demonstrates a scalable way to leverage global textual knowledge for more precise temporal grounding, with clear implications for cross-modal video understanding tasks.
Abstract
Temporal sentence grounding involves the retrieval of a video moment with a natural language query. Many existing works directly incorporate the given video and temporally localized query for temporal grounding, overlooking the inherent domain gap between different modalities. In this paper, we utilize pseudo-query features containing extensive temporally global textual knowledge sourced from the same video-query pair, to enhance the bridging of domain gaps and attain a heightened level of similarity between multi-modal features. Specifically, we propose a Pseudo-query Intermediary Network (PIN) to achieve an improved alignment of visual and comprehensive pseudo-query features within the feature space through contrastive learning. Subsequently, we utilize learnable prompts to encapsulate the knowledge of pseudo-queries, propagating them into the textual encoder and multi-modal fusion module, further enhancing the feature alignment between visual and language for better temporal grounding. Extensive experiments conducted on the Charades-STA and ActivityNet-Captions datasets demonstrate the effectiveness of our method.
