Table of Contents
Fetching ...

Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking

Huu-Loc Tran, Tinh-Anh Nguyen-Nhu, Huu-Phong Phan-Nguyen, Tien-Huy Nguyen, Nhat-Minh Nguyen-Dich, Anh Dao, Huy-Duc Do, Quan Nguyen, Hoang M. Le, Quang-Vinh Dinh

TL;DR

This work tackles efficient retrieval from long-form videos via an interactive framework. It integrates four innovations: ensemble search combining BEIT-3 and CLIP, storage optimization with keyframe dedup, a dual-query temporal search for accurate localization, and temporal reranking using neighboring frame context. The data storage pipeline (keyframe selection, feature extraction, deduplication) and the integrated interactive system enable moment retrieval and QA with improved precision and interpretability. Experimental results on known-item search and video QA demonstrate substantial gains in accuracy and efficiency, highlighting the framework's practicality for real-world interactive video retrieval.

Abstract

Long-form video understanding presents significant challenges for interactive retrieval systems, as conventional methods struggle to process extensive video content efficiently. Existing approaches often rely on single models, inefficient storage, unstable temporal search, and context-agnostic reranking, limiting their effectiveness. This paper presents a novel framework to enhance interactive video retrieval through four key innovations: (1) an ensemble search strategy that integrates coarse-grained (CLIP) and fine-grained (BEIT3) models to improve retrieval accuracy, (2) a storage optimization technique that reduces redundancy by selecting representative keyframes via TransNetV2 and deduplication, (3) a temporal search mechanism that localizes video segments using dual queries for start and end points, and (4) a temporal reranking approach that leverages neighboring frame context to stabilize rankings. Evaluated on known-item search and question-answering tasks, our framework demonstrates substantial improvements in retrieval precision, efficiency, and user interpretability, offering a robust solution for real-world interactive video retrieval applications.

Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking

TL;DR

This work tackles efficient retrieval from long-form videos via an interactive framework. It integrates four innovations: ensemble search combining BEIT-3 and CLIP, storage optimization with keyframe dedup, a dual-query temporal search for accurate localization, and temporal reranking using neighboring frame context. The data storage pipeline (keyframe selection, feature extraction, deduplication) and the integrated interactive system enable moment retrieval and QA with improved precision and interpretability. Experimental results on known-item search and video QA demonstrate substantial gains in accuracy and efficiency, highlighting the framework's practicality for real-world interactive video retrieval.

Abstract

Long-form video understanding presents significant challenges for interactive retrieval systems, as conventional methods struggle to process extensive video content efficiently. Existing approaches often rely on single models, inefficient storage, unstable temporal search, and context-agnostic reranking, limiting their effectiveness. This paper presents a novel framework to enhance interactive video retrieval through four key innovations: (1) an ensemble search strategy that integrates coarse-grained (CLIP) and fine-grained (BEIT3) models to improve retrieval accuracy, (2) a storage optimization technique that reduces redundancy by selecting representative keyframes via TransNetV2 and deduplication, (3) a temporal search mechanism that localizes video segments using dual queries for start and end points, and (4) a temporal reranking approach that leverages neighboring frame context to stabilize rankings. Evaluated on known-item search and question-answering tasks, our framework demonstrates substantial improvements in retrieval precision, efficiency, and user interpretability, offering a robust solution for real-world interactive video retrieval applications.

Paper Structure

This paper contains 22 sections, 1 equation, 7 figures, 4 algorithms.

Figures (7)

  • Figure 1: Overview of our iteractive retrieval system. The system ranks the top-k results through a reranking module (Section\ref{['subsec:rerank']} before passing them to an ensemble module (Section\ref{['subsec:ensemble']}) for final selection. The temporal search module (Section\ref{['subsec:temporal']}) refines the results by identifying the most relevant time segments, ensuring the retrieval aligns with the query's temporal context. The final output consists of the most relevant moments, providing accurate answers based on the keyframe range.
  • Figure 2: Videos are segmented, deduplicated with cosine embeddings, and stored in the FAISS index.
  • Figure 3: Before and after frame filtering.
  • Figure 4: The UI for Interactive Retrieval System. The selected start frame will be annotated as a frame with green border, whereas the end frame will be annotated as a frame with red border
  • Figure 5: Experimental results for Known-Item Search. The actual target frame is highlighted with a red box.
  • ...and 2 more figures