Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering
Peipei Song, Long Zhang, Long Lan, Weidong Chen, Dan Guo, Xun Yang, Meng Wang
TL;DR
This work addresses partially relevant video retrieval (PRVR) for untrimmed videos by shifting from dense clip modeling to active moment discovery. The proposed AMDNet learns span anchors representing semantic moments and uses masked multi-moment attention to produce compact moment-enhanced video representations, enabling robust text-to-video matching via a max-over-moments similarity. It introduces three losses—partially relevant retrieval, moment diversity, and moment relevance—to jointly optimize retrieval and moment discovery in an end-to-end framework. Empirical results on TVR and ActivityNet Captions show AMDNet achieving state-of-the-art performance with substantially improved efficiency, including a lightweight parameter footprint and faster retrieval, highlighting its practical applicability for large-scale video search.
Abstract
Partially relevant video retrieval (PRVR) is a practical yet challenging task in text-to-video retrieval, where videos are untrimmed and contain much background content. The pursuit here is of both effective and efficient solutions to capture the partial correspondence between text queries and untrimmed videos. Existing PRVR methods, which typically focus on modeling multi-scale clip representations, however, suffer from content independence and information redundancy, impairing retrieval performance. To overcome these limitations, we propose a simple yet effective approach with active moment discovering (AMDNet). We are committed to discovering video moments that are semantically consistent with their queries. By using learnable span anchors to capture distinct moments and applying masked multi-moment attention to emphasize salient moments while suppressing redundant backgrounds, we achieve more compact and informative video representations. To further enhance moment modeling, we introduce a moment diversity loss to encourage different moments of distinct regions and a moment relevance loss to promote semantically query-relevant moments, which cooperate with a partially relevant retrieval loss for end-to-end optimization. Extensive experiments on two large-scale video datasets (\ie, TVR and ActivityNet Captions) demonstrate the superiority and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller (\#parameters) while 6.0 points higher (SumR) than the up-to-date method GMMFormer on TVR.
