Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

Mohamed Eltahir; Osamah Sarraj; Mohammed Bremoo; Mohammed Khurd; Abdulrahman Alfrihidi; Taha Alshatiri; Mohammad Almatrafi; Tanveer Hussain

Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

Mohamed Eltahir, Osamah Sarraj, Mohammed Bremoo, Mohammed Khurd, Abdulrahman Alfrihidi, Taha Alshatiri, Mohammad Almatrafi, Tanveer Hussain

TL;DR

This work tackles long-video retrieval under zero-shot conditions by introducing a unified multimodal framework that segments input via subtitles and processes visual and aural streams independently before a final intersection-based alignment. The visual stream uses $C_i$ and $Q_{user}$ embeddings from a video CLIP model to retrieve top-$K$ clips, while the aural stream encodes subtitles $S$ and the query to generate a top-$K$ set, enriched by a lexical heuristic and re-ranked by an LLM to form $S^*$. A final intersection on shared timestamp intervals combines the two streams, avoiding reliance on predefined segmentation and enabling robust long-video understanding; the system is evaluated with a novel long-video metric on YouCook2, using Average Recall@K across intersection thresholds. The results show VideoCLIP-XL achieving the best performance among tested models, illustrating strong vision-text alignment in long-form video retrieval and highlighting opportunities for further improvements in segmentation and candidate generation. This framework advances practical long-video retrieval by enabling zero-shot applicability and providing a benchmark direction for long-video evaluation metrics.

Abstract

Precise video retrieval requires multi-modal correlations to handle unseen vocabulary and scenes, becoming more complex for lengthy videos where models must perform effectively without prior training on a specific dataset. We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach. Additionally, the aural stream includes a complementary audio-based two-stage retrieval mechanism that enhances performance on long-duration videos. Considering the complex nature of retrieval from lengthy videos and its corresponding evaluation, we introduce a new retrieval evaluation method specifically designed for long-video retrieval to support further research. We conducted experiments on the YouCook2 benchmark, showing promising retrieval performance.

Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

TL;DR

and

embeddings from a video CLIP model to retrieve top-

clips, while the aural stream encodes subtitles

and the query to generate a top-

set, enriched by a lexical heuristic and re-ranked by an LLM to form

. A final intersection on shared timestamp intervals combines the two streams, avoiding reliance on predefined segmentation and enabling robust long-video understanding; the system is evaluated with a novel long-video metric on YouCook2, using Average Recall@K across intersection thresholds. The results show VideoCLIP-XL achieving the best performance among tested models, illustrating strong vision-text alignment in long-form video retrieval and highlighting opportunities for further improvements in segmentation and candidate generation. This framework advances practical long-video retrieval by enabling zero-shot applicability and providing a benchmark direction for long-video evaluation metrics.

Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

TL;DR

Abstract

Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)