Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric
Mohamed Eltahir, Osamah Sarraj, Mohammed Bremoo, Mohammed Khurd, Abdulrahman Alfrihidi, Taha Alshatiri, Mohammad Almatrafi, Tanveer Hussain
TL;DR
This work tackles long-video retrieval under zero-shot conditions by introducing a unified multimodal framework that segments input via subtitles and processes visual and aural streams independently before a final intersection-based alignment. The visual stream uses $C_i$ and $Q_{user}$ embeddings from a video CLIP model to retrieve top-$K$ clips, while the aural stream encodes subtitles $S$ and the query to generate a top-$K$ set, enriched by a lexical heuristic and re-ranked by an LLM to form $S^*$. A final intersection on shared timestamp intervals combines the two streams, avoiding reliance on predefined segmentation and enabling robust long-video understanding; the system is evaluated with a novel long-video metric on YouCook2, using Average Recall@K across intersection thresholds. The results show VideoCLIP-XL achieving the best performance among tested models, illustrating strong vision-text alignment in long-form video retrieval and highlighting opportunities for further improvements in segmentation and candidate generation. This framework advances practical long-video retrieval by enabling zero-shot applicability and providing a benchmark direction for long-video evaluation metrics.
Abstract
Precise video retrieval requires multi-modal correlations to handle unseen vocabulary and scenes, becoming more complex for lengthy videos where models must perform effectively without prior training on a specific dataset. We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach. Additionally, the aural stream includes a complementary audio-based two-stage retrieval mechanism that enhances performance on long-duration videos. Considering the complex nature of retrieval from lengthy videos and its corresponding evaluation, we introduce a new retrieval evaluation method specifically designed for long-video retrieval to support further research. We conducted experiments on the YouCook2 benchmark, showing promising retrieval performance.
