A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search
Tinh-Anh Nguyen-Nhu, Huu-Loc Tran, Nguyen-Khang Le, Minh-Nhat Nguyen, Tien-Huy Nguyen, Hoang-Long Nguyen-Huu, Huu-Phong Phan-Nguyen, Huy-Thach Pham, Quan Nguyen, Hoang M. Le, Quang-Vinh Dinh
TL;DR
This paper tackles the challenge of efficiently retrieving moment-level information from long, untrimmed video corpora by introducing the Interactive Video Corpus Moment Retrieval (GRAB) framework. It combines shot-based keyframe preprocessing with perceptual-hash dedup, FAISS-based fast retrieval, and a novel SuperGlobal reranking mechanism to improve semantic ranking while reducing memory and compute. A core contribution is Adaptive Bidirectional Temporal Search (ABTS), which jointly optimizes semantic similarity and temporal stability to pinpoint precise start and end times, validated on Known-Item Search and Video QA tasks. The approach demonstrates strong localization accuracy, scalable storage requirements, and interactive capabilities, making it well-suited for large video repositories and practical search applications.
Abstract
The exponential growth of digital video content has posed critical challenges in moment-level video retrieval, where existing methodologies struggle to efficiently localize specific segments within an expansive video corpus. Current retrieval systems are constrained by computational inefficiencies, temporal context limitations, and the intrinsic complexity of navigating video content. In this paper, we address these limitations through a novel Interactive Video Corpus Moment Retrieval framework that integrates a SuperGlobal Reranking mechanism and Adaptive Bidirectional Temporal Search (ABTS), strategically optimizing query similarity, temporal stability, and computational resources. By preprocessing a large corpus of videos using a keyframe extraction model and deduplication technique through image hashing, our approach provides a scalable solution that significantly reduces storage requirements while maintaining high localization precision across diverse video repositories.
