Table of Contents
Fetching ...

Multimodal Contextualized Support for Enhancing Video Retrieval System

Quoc-Bao Nguyen-Le, Thanh-Huy Le-Nguyen

TL;DR

The paper addresses the limitation of video retrieval systems that focus on single keyframes by proposing a multimodal pipeline that aggregates information across sequences of frames, augmented with audio context to extract high-level semantics. It integrates frame deduplication (Dinov2), vision-language alignment (Nomic, Uform), text descriptions from LLMs (Vintern, Phi3), prompted object detection (YOLOv8), and video-level representations (ViClipB16/ViClipL14, VideoIntern) plus audio-aware abstraction (Phi-35 with Whisper-derived summaries). A Faiss-based vector search with multi-model score aggregation and a user-facing frontend enables efficient clip-level retrieval, result verification, and easy addition of selected frames to a dataset. The approach demonstrates improved semantic understanding and retrieval performance for complex, clip-based queries, with practical implications for scalable, multimodal video search interfaces and future UI enhancements.

Abstract

Current video retrieval systems, especially those used in competitions, primarily focus on querying individual keyframes or images rather than encoding an entire clip or video segment. However, queries often describe an action or event over a series of frames, not a specific image. This results in insufficient information when analyzing a single frame, leading to less accurate query results. Moreover, extracting embeddings solely from images (keyframes) does not provide enough information for models to encode higher-level, more abstract insights inferred from the video. These models tend to only describe the objects present in the frame, lacking a deeper understanding. In this work, we propose a system that integrates the latest methodologies, introducing a novel pipeline that extracts multimodal data, and incorporate information from multiple frames within a video, enabling the model to abstract higher-level information that captures latent meanings, focusing on what can be inferred from the video clip, rather than just focusing on object detection in one single image.

Multimodal Contextualized Support for Enhancing Video Retrieval System

TL;DR

The paper addresses the limitation of video retrieval systems that focus on single keyframes by proposing a multimodal pipeline that aggregates information across sequences of frames, augmented with audio context to extract high-level semantics. It integrates frame deduplication (Dinov2), vision-language alignment (Nomic, Uform), text descriptions from LLMs (Vintern, Phi3), prompted object detection (YOLOv8), and video-level representations (ViClipB16/ViClipL14, VideoIntern) plus audio-aware abstraction (Phi-35 with Whisper-derived summaries). A Faiss-based vector search with multi-model score aggregation and a user-facing frontend enables efficient clip-level retrieval, result verification, and easy addition of selected frames to a dataset. The approach demonstrates improved semantic understanding and retrieval performance for complex, clip-based queries, with practical implications for scalable, multimodal video search interfaces and future UI enhancements.

Abstract

Current video retrieval systems, especially those used in competitions, primarily focus on querying individual keyframes or images rather than encoding an entire clip or video segment. However, queries often describe an action or event over a series of frames, not a specific image. This results in insufficient information when analyzing a single frame, leading to less accurate query results. Moreover, extracting embeddings solely from images (keyframes) does not provide enough information for models to encode higher-level, more abstract insights inferred from the video. These models tend to only describe the objects present in the frame, lacking a deeper understanding. In this work, we propose a system that integrates the latest methodologies, introducing a novel pipeline that extracts multimodal data, and incorporate information from multiple frames within a video, enabling the model to abstract higher-level information that captures latent meanings, focusing on what can be inferred from the video clip, rather than just focusing on object detection in one single image.

Paper Structure

This paper contains 15 sections, 5 equations, 4 figures.

Figures (4)

  • Figure 1: The architecture of the CLIP model is shown on the left. However, Nomic Vision outperforms CLIP across all benchmarks. Our experience also confirms that Nomic is significantly superior to CLIP in visual-language retrieval tasks.
  • Figure 2: Although all the input texts describe a context involving a dog, ViClipB16 accurately interprets the sequence of frames and assigns the highest score to the first text, which also matches the query.
  • Figure 3: Phi-35, enhanced with contextualized audio summaries, can grasp the high-level concepts behind a clip, making it well-suited for complex and abstract queries.
  • Figure 4: Query:"The video is presented through a series of consecutive colored drawings. The content of the drawings depicts a trial in court. There is an American flag in one of the drawings". In this example, we select the Nomic method from eight other options to query, and the correct result appears ranked first. When the user clicks on any frame, a modal displays a list of preceding and following frames. On the right side, the video plays at the corresponding timestamp, allowing the user to verify the results. Users can easily click "choose this" and the selected frames are automatically registered to database and displayed on the right submission manager panel.