Multimodal Contextualized Support for Enhancing Video Retrieval System
Quoc-Bao Nguyen-Le, Thanh-Huy Le-Nguyen
TL;DR
The paper addresses the limitation of video retrieval systems that focus on single keyframes by proposing a multimodal pipeline that aggregates information across sequences of frames, augmented with audio context to extract high-level semantics. It integrates frame deduplication (Dinov2), vision-language alignment (Nomic, Uform), text descriptions from LLMs (Vintern, Phi3), prompted object detection (YOLOv8), and video-level representations (ViClipB16/ViClipL14, VideoIntern) plus audio-aware abstraction (Phi-35 with Whisper-derived summaries). A Faiss-based vector search with multi-model score aggregation and a user-facing frontend enables efficient clip-level retrieval, result verification, and easy addition of selected frames to a dataset. The approach demonstrates improved semantic understanding and retrieval performance for complex, clip-based queries, with practical implications for scalable, multimodal video search interfaces and future UI enhancements.
Abstract
Current video retrieval systems, especially those used in competitions, primarily focus on querying individual keyframes or images rather than encoding an entire clip or video segment. However, queries often describe an action or event over a series of frames, not a specific image. This results in insufficient information when analyzing a single frame, leading to less accurate query results. Moreover, extracting embeddings solely from images (keyframes) does not provide enough information for models to encode higher-level, more abstract insights inferred from the video. These models tend to only describe the objects present in the frame, lacking a deeper understanding. In this work, we propose a system that integrates the latest methodologies, introducing a novel pipeline that extracts multimodal data, and incorporate information from multiple frames within a video, enabling the model to abstract higher-level information that captures latent meanings, focusing on what can be inferred from the video clip, rather than just focusing on object detection in one single image.
