Table of Contents
Fetching ...

Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering

Peipei Song, Long Zhang, Long Lan, Weidong Chen, Dan Guo, Xun Yang, Meng Wang

TL;DR

This work addresses partially relevant video retrieval (PRVR) for untrimmed videos by shifting from dense clip modeling to active moment discovery. The proposed AMDNet learns span anchors representing semantic moments and uses masked multi-moment attention to produce compact moment-enhanced video representations, enabling robust text-to-video matching via a max-over-moments similarity. It introduces three losses—partially relevant retrieval, moment diversity, and moment relevance—to jointly optimize retrieval and moment discovery in an end-to-end framework. Empirical results on TVR and ActivityNet Captions show AMDNet achieving state-of-the-art performance with substantially improved efficiency, including a lightweight parameter footprint and faster retrieval, highlighting its practical applicability for large-scale video search.

Abstract

Partially relevant video retrieval (PRVR) is a practical yet challenging task in text-to-video retrieval, where videos are untrimmed and contain much background content. The pursuit here is of both effective and efficient solutions to capture the partial correspondence between text queries and untrimmed videos. Existing PRVR methods, which typically focus on modeling multi-scale clip representations, however, suffer from content independence and information redundancy, impairing retrieval performance. To overcome these limitations, we propose a simple yet effective approach with active moment discovering (AMDNet). We are committed to discovering video moments that are semantically consistent with their queries. By using learnable span anchors to capture distinct moments and applying masked multi-moment attention to emphasize salient moments while suppressing redundant backgrounds, we achieve more compact and informative video representations. To further enhance moment modeling, we introduce a moment diversity loss to encourage different moments of distinct regions and a moment relevance loss to promote semantically query-relevant moments, which cooperate with a partially relevant retrieval loss for end-to-end optimization. Extensive experiments on two large-scale video datasets (\ie, TVR and ActivityNet Captions) demonstrate the superiority and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller (\#parameters) while 6.0 points higher (SumR) than the up-to-date method GMMFormer on TVR.

Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering

TL;DR

This work addresses partially relevant video retrieval (PRVR) for untrimmed videos by shifting from dense clip modeling to active moment discovery. The proposed AMDNet learns span anchors representing semantic moments and uses masked multi-moment attention to produce compact moment-enhanced video representations, enabling robust text-to-video matching via a max-over-moments similarity. It introduces three losses—partially relevant retrieval, moment diversity, and moment relevance—to jointly optimize retrieval and moment discovery in an end-to-end framework. Empirical results on TVR and ActivityNet Captions show AMDNet achieving state-of-the-art performance with substantially improved efficiency, including a lightweight parameter footprint and faster retrieval, highlighting its practical applicability for large-scale video search.

Abstract

Partially relevant video retrieval (PRVR) is a practical yet challenging task in text-to-video retrieval, where videos are untrimmed and contain much background content. The pursuit here is of both effective and efficient solutions to capture the partial correspondence between text queries and untrimmed videos. Existing PRVR methods, which typically focus on modeling multi-scale clip representations, however, suffer from content independence and information redundancy, impairing retrieval performance. To overcome these limitations, we propose a simple yet effective approach with active moment discovering (AMDNet). We are committed to discovering video moments that are semantically consistent with their queries. By using learnable span anchors to capture distinct moments and applying masked multi-moment attention to emphasize salient moments while suppressing redundant backgrounds, we achieve more compact and informative video representations. To further enhance moment modeling, we introduce a moment diversity loss to encourage different moments of distinct regions and a moment relevance loss to promote semantically query-relevant moments, which cooperate with a partially relevant retrieval loss for end-to-end optimization. Extensive experiments on two large-scale video datasets (\ie, TVR and ActivityNet Captions) demonstrate the superiority and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller (\#parameters) while 6.0 points higher (SumR) than the up-to-date method GMMFormer on TVR.

Paper Structure

This paper contains 33 sections, 9 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Comparison of existing PRVR methods (a) and our method (b). Unlike previous dense clip modeling with content independence and information redundancy, we focus on discovering compact moments in untrimmed videos with learnable moment spans.
  • Figure 2: An overview of our proposed AMDNet. Given an untrimmed video and query input, we first extract their features ${\bf V}$ and ${\bf q}$. Then, we predict the center and width anchors $[{\bf c},{\bf w}]$ and convert them into a mask matrix ${\bf M}$. ${\bf M}$ is used to modulate the video encodings via masked multi-moment attention and give the moment-enhanced video representations ${\bf V}^g$. Finally, the text-video similarity is obtained by max-pooling the similarity relations between ${\bf V}^g$ and ${\bf q}$. The model is jointly optimized with multi-task losses, including a partially relevant retrieval loss, a moment diversity loss, and a moment relevance loss.
  • Figure 3: Illustration of masked multi-moment attention. It updates the video clip features ${\bf V}$ to moment-enhanced features ${\bf V}^g$ under the guidance of moment mask ${\bf M}$. $H$ is the number of moment proposals in a video.
  • Figure 4: Performance on different types of queries. Queries are grouped according to their moment-to-video ratios (M/V). The smaller M/V indicates more challenging queries.
  • Figure 5: The performance (i.e., SumR), FLOPs, and # of trainable parameters for various PRVR models on the TVR dataset. The center of the bubble indicates the value of SumR. The diameter of the bubble or star is proportional to the #parameters (M) while the horizontal axis indicates the FLOPs (G).
  • ...and 5 more figures