A Flexible and Scalable Framework for Video Moment Search
Chongzhi Zhang, Xizhou Zhu, Aixin Sun
TL;DR
This work tackles Ranked Video Moment Retrieval (RVMR) by introducing SPR, a three-stage, fixed-segment framework that splits videos into uniform units, indexes segment embeddings offline, and uses a coarse-to-fine pipeline (segment retrieval → coarse proposal generation → refinement with re-ranking) to retrieve a ranked list of moments. By projecting text and segment features into a shared space and applying Approximate Nearest Neighbor (ANN) search via Faiss, SPR achieves near real-time inference over large corpora and handles videos of any length. Evaluated on TVR-Ranking, SPR delivers state-of-the-art $NDCG@K$ while significantly reducing computation and latency; its modular design allows independent improvements to segment retrieval, proposal generation, and refinement/re-ranking, with instantiations based on CLIP or ReLoCLNet architectures. The practical, scalable approach demonstrates strong potential for real-world video moment search applications, including robustness to extraneous data and efficient scalability to larger corpora.
Abstract
Video moment search, the process of finding relevant moments in a video corpus to match a user's query, is crucial for various applications. Existing solutions, however, often assume a single perfect matching moment, struggle with inefficient inference, and have limitations with hour-long videos. This paper introduces a flexible and scalable framework for retrieving a ranked list of moments from collection of videos in any length to match a text query, a task termed Ranked Video Moment Retrieval (RVMR). Our framework, called Segment-Proposal-Ranking (SPR), simplifies the search process into three independent stages: segment retrieval, proposal generation, and moment refinement with re-ranking. Specifically, videos are divided into equal-length segments with precomputed embeddings indexed offline, allowing efficient retrieval regardless of video length. For scalable online retrieval, both segments and queries are projected into a shared feature space to enable approximate nearest neighbor (ANN) search. Retrieved segments are then merged into coarse-grained moment proposals. Then a refinement and re-ranking module is designed to reorder and adjust timestamps of the coarse-grained proposals. Evaluations on the TVR-Ranking dataset demonstrate that our framework achieves state-of-the-art performance with significant reductions in computational cost and processing time. The flexible design also allows for independent improvements to each stage, making SPR highly adaptable for large-scale applications.
