Table of Contents
Fetching ...

A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search

Tinh-Anh Nguyen-Nhu, Huu-Loc Tran, Nguyen-Khang Le, Minh-Nhat Nguyen, Tien-Huy Nguyen, Hoang-Long Nguyen-Huu, Huu-Phong Phan-Nguyen, Huy-Thach Pham, Quan Nguyen, Hoang M. Le, Quang-Vinh Dinh

TL;DR

This paper tackles the challenge of efficiently retrieving moment-level information from long, untrimmed video corpora by introducing the Interactive Video Corpus Moment Retrieval (GRAB) framework. It combines shot-based keyframe preprocessing with perceptual-hash dedup, FAISS-based fast retrieval, and a novel SuperGlobal reranking mechanism to improve semantic ranking while reducing memory and compute. A core contribution is Adaptive Bidirectional Temporal Search (ABTS), which jointly optimizes semantic similarity and temporal stability to pinpoint precise start and end times, validated on Known-Item Search and Video QA tasks. The approach demonstrates strong localization accuracy, scalable storage requirements, and interactive capabilities, making it well-suited for large video repositories and practical search applications.

Abstract

The exponential growth of digital video content has posed critical challenges in moment-level video retrieval, where existing methodologies struggle to efficiently localize specific segments within an expansive video corpus. Current retrieval systems are constrained by computational inefficiencies, temporal context limitations, and the intrinsic complexity of navigating video content. In this paper, we address these limitations through a novel Interactive Video Corpus Moment Retrieval framework that integrates a SuperGlobal Reranking mechanism and Adaptive Bidirectional Temporal Search (ABTS), strategically optimizing query similarity, temporal stability, and computational resources. By preprocessing a large corpus of videos using a keyframe extraction model and deduplication technique through image hashing, our approach provides a scalable solution that significantly reduces storage requirements while maintaining high localization precision across diverse video repositories.

A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search

TL;DR

This paper tackles the challenge of efficiently retrieving moment-level information from long, untrimmed video corpora by introducing the Interactive Video Corpus Moment Retrieval (GRAB) framework. It combines shot-based keyframe preprocessing with perceptual-hash dedup, FAISS-based fast retrieval, and a novel SuperGlobal reranking mechanism to improve semantic ranking while reducing memory and compute. A core contribution is Adaptive Bidirectional Temporal Search (ABTS), which jointly optimizes semantic similarity and temporal stability to pinpoint precise start and end times, validated on Known-Item Search and Video QA tasks. The approach demonstrates strong localization accuracy, scalable storage requirements, and interactive capabilities, making it well-suited for large video repositories and practical search applications.

Abstract

The exponential growth of digital video content has posed critical challenges in moment-level video retrieval, where existing methodologies struggle to efficiently localize specific segments within an expansive video corpus. Current retrieval systems are constrained by computational inefficiencies, temporal context limitations, and the intrinsic complexity of navigating video content. In this paper, we address these limitations through a novel Interactive Video Corpus Moment Retrieval framework that integrates a SuperGlobal Reranking mechanism and Adaptive Bidirectional Temporal Search (ABTS), strategically optimizing query similarity, temporal stability, and computational resources. By preprocessing a large corpus of videos using a keyframe extraction model and deduplication technique through image hashing, our approach provides a scalable solution that significantly reduces storage requirements while maintaining high localization precision across diverse video repositories.

Paper Structure

This paper contains 17 sections, 8 equations, 7 figures, 2 algorithms.

Figures (7)

  • Figure 1: Overview of our GRAB — Global Re-ranking and Adaptive Bidirectional search system. The user begins by entering a natural language query to search for semantically relevant keyframes in a preprocessed video corpus. a) In Section\ref{['subsec:data_preprocessing']} data preprocessing, raw videos are segmented using shot detection, and representative keyframes are extracted and deduplicated to form a storage-efficient and visually diverse index. b) In Section\ref{['subsec:search_rerank']} Embedding-based searching and reranking, the user query is embedded and compared against the keyframe database using FAISS for fast retrieval, followed by SuperGlobal Reranking to refine the results. The user then selects a pivot frame from the top-ranked results. c) In Section\ref{['subsec:temporal_search']}, Adaptive Bidirectional Temporal Search identifies precise start and end boundaries based on semantic similarity and temporal stability. The interface supports interactive refinement and QA-based boundary validation.
  • Figure 2: User Interface of Our Interactive Video Corpus Moment Retrieval System.
  • Figure 3: Visualization of different interaction components: (a) Moment Exploration, (b) Moment Selection, and (c) QA and Boundary Selection.
  • Figure 4: Demonstration of our reranking function's effectiveness in retrieving frames most matching to the query.
  • Figure 5: Demonstration of our temporal search function's ability to recognize and interpret complex spatial compositions across video sequences. The selected frames reflect accurate alignment with the query’s described layout, capturing foreground, background, and motion cues to support precise moment localization.
  • ...and 2 more figures