A Flexible and Scalable Framework for Video Moment Search

Chongzhi Zhang; Xizhou Zhu; Aixin Sun

A Flexible and Scalable Framework for Video Moment Search

Chongzhi Zhang, Xizhou Zhu, Aixin Sun

TL;DR

This work tackles Ranked Video Moment Retrieval (RVMR) by introducing SPR, a three-stage, fixed-segment framework that splits videos into uniform units, indexes segment embeddings offline, and uses a coarse-to-fine pipeline (segment retrieval → coarse proposal generation → refinement with re-ranking) to retrieve a ranked list of moments. By projecting text and segment features into a shared space and applying Approximate Nearest Neighbor (ANN) search via Faiss, SPR achieves near real-time inference over large corpora and handles videos of any length. Evaluated on TVR-Ranking, SPR delivers state-of-the-art $NDCG@K$ while significantly reducing computation and latency; its modular design allows independent improvements to segment retrieval, proposal generation, and refinement/re-ranking, with instantiations based on CLIP or ReLoCLNet architectures. The practical, scalable approach demonstrates strong potential for real-world video moment search applications, including robustness to extraneous data and efficient scalability to larger corpora.

Abstract

Video moment search, the process of finding relevant moments in a video corpus to match a user's query, is crucial for various applications. Existing solutions, however, often assume a single perfect matching moment, struggle with inefficient inference, and have limitations with hour-long videos. This paper introduces a flexible and scalable framework for retrieving a ranked list of moments from collection of videos in any length to match a text query, a task termed Ranked Video Moment Retrieval (RVMR). Our framework, called Segment-Proposal-Ranking (SPR), simplifies the search process into three independent stages: segment retrieval, proposal generation, and moment refinement with re-ranking. Specifically, videos are divided into equal-length segments with precomputed embeddings indexed offline, allowing efficient retrieval regardless of video length. For scalable online retrieval, both segments and queries are projected into a shared feature space to enable approximate nearest neighbor (ANN) search. Retrieved segments are then merged into coarse-grained moment proposals. Then a refinement and re-ranking module is designed to reorder and adjust timestamps of the coarse-grained proposals. Evaluations on the TVR-Ranking dataset demonstrate that our framework achieves state-of-the-art performance with significant reductions in computational cost and processing time. The flexible design also allows for independent improvements to each stage, making SPR highly adaptable for large-scale applications.

A Flexible and Scalable Framework for Video Moment Search

TL;DR

while significantly reducing computation and latency; its modular design allows independent improvements to segment retrieval, proposal generation, and refinement/re-ranking, with instantiations based on CLIP or ReLoCLNet architectures. The practical, scalable approach demonstrates strong potential for real-world video moment search applications, including robustness to extraneous data and efficient scalability to larger corpora.

Abstract

Paper Structure (26 sections, 2 equations, 4 figures, 7 tables)

This paper contains 26 sections, 2 equations, 4 figures, 7 tables.

Introduction
Related Work
Localizing Moment in Video(s)
Retrieval Frameworks
The SPR Framework
Segment Retrieval
Offline Index Construction
Online Segment Retrieval
Instantiation
Coarse Moment Proposal Generation
Moment Refinement and Re-ranking
Inference Pipeline
Instantiation
Experiments
Datasets and Evaluation Metrics
...and 11 more sections

Figures (4)

Figure 1: The Segment-Proposal-Ranking (SPR) framework. All videos are divided into non-overlapping, equal-length segments (e.g.,, 4 seconds) for indexing and searching. The final results are computed based on the relevant segments retrieved.
Figure 2: Segment retrieval. With the offline constructed index, the online search/inference takes less than 0.2 seconds to retrieve 100-200 relevant segments for a given query.
Figure 3: Refinement and re-ranking. This module computes precise timestamps of matching moments and re-ranks them by their relevance to the given query.
Figure 4: Visualization of the ground truth moment and the results from SP and SPR for two example queries from the TVR-Ranking dataset. SP generates coarse proposals by aggregating retrieved relevant segments, while SPR refines and re-ranks these proposals. Although SP retrieves highly relevant moments, its timestamps are constrained by the pre-defined segment length (e.g., 4 seconds in our setting). Hence, all proposals from SP have a length that is a multiple of 4 seconds. In contrast, SPR identifies more precise timestamps and ranks relevant moments more effectively.

A Flexible and Scalable Framework for Video Moment Search

TL;DR

Abstract

A Flexible and Scalable Framework for Video Moment Search

Authors

TL;DR

Abstract

Table of Contents

Figures (4)