Table of Contents
Fetching ...

Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

Mohamed Eltahir, Osamah Sarraj, Mohammed Bremoo, Mohammed Khurd, Abdulrahman Alfrihidi, Taha Alshatiri, Mohammad Almatrafi, Tanveer Hussain

TL;DR

This work tackles long-video retrieval under zero-shot conditions by introducing a unified multimodal framework that segments input via subtitles and processes visual and aural streams independently before a final intersection-based alignment. The visual stream uses $C_i$ and $Q_{user}$ embeddings from a video CLIP model to retrieve top-$K$ clips, while the aural stream encodes subtitles $S$ and the query to generate a top-$K$ set, enriched by a lexical heuristic and re-ranked by an LLM to form $S^*$. A final intersection on shared timestamp intervals combines the two streams, avoiding reliance on predefined segmentation and enabling robust long-video understanding; the system is evaluated with a novel long-video metric on YouCook2, using Average Recall@K across intersection thresholds. The results show VideoCLIP-XL achieving the best performance among tested models, illustrating strong vision-text alignment in long-form video retrieval and highlighting opportunities for further improvements in segmentation and candidate generation. This framework advances practical long-video retrieval by enabling zero-shot applicability and providing a benchmark direction for long-video evaluation metrics.

Abstract

Precise video retrieval requires multi-modal correlations to handle unseen vocabulary and scenes, becoming more complex for lengthy videos where models must perform effectively without prior training on a specific dataset. We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach. Additionally, the aural stream includes a complementary audio-based two-stage retrieval mechanism that enhances performance on long-duration videos. Considering the complex nature of retrieval from lengthy videos and its corresponding evaluation, we introduce a new retrieval evaluation method specifically designed for long-video retrieval to support further research. We conducted experiments on the YouCook2 benchmark, showing promising retrieval performance.

Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

TL;DR

This work tackles long-video retrieval under zero-shot conditions by introducing a unified multimodal framework that segments input via subtitles and processes visual and aural streams independently before a final intersection-based alignment. The visual stream uses and embeddings from a video CLIP model to retrieve top- clips, while the aural stream encodes subtitles and the query to generate a top- set, enriched by a lexical heuristic and re-ranked by an LLM to form . A final intersection on shared timestamp intervals combines the two streams, avoiding reliance on predefined segmentation and enabling robust long-video understanding; the system is evaluated with a novel long-video metric on YouCook2, using Average Recall@K across intersection thresholds. The results show VideoCLIP-XL achieving the best performance among tested models, illustrating strong vision-text alignment in long-form video retrieval and highlighting opportunities for further improvements in segmentation and candidate generation. This framework advances practical long-video retrieval by enabling zero-shot applicability and providing a benchmark direction for long-video evaluation metrics.

Abstract

Precise video retrieval requires multi-modal correlations to handle unseen vocabulary and scenes, becoming more complex for lengthy videos where models must perform effectively without prior training on a specific dataset. We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach. Additionally, the aural stream includes a complementary audio-based two-stage retrieval mechanism that enhances performance on long-duration videos. Considering the complex nature of retrieval from lengthy videos and its corresponding evaluation, we introduce a new retrieval evaluation method specifically designed for long-video retrieval to support further research. We conducted experiments on the YouCook2 benchmark, showing promising retrieval performance.

Paper Structure

This paper contains 7 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: The proposed framework for multi-modal video querying. The visual stream retrieves the top $K$ clips ($\textit{C*}$), while the aural stream retrieves the top $K$ subtitles ($\textit{S*}$). The final retrieved clips are obtained by intersecting $\textit{C*}$ and $\textit{S*}$, ranked by averaged similarity scores.
  • Figure 2: Recall@1 scores of the framework across intersection thresholds.
  • Figure 3: Recall@5 scores of the framework across intersection thresholds.
  • Figure 4: Recall@10 scores of the framework across intersection thresholds.
  • Figure 5: Qualitative results for the query: "Add the squid into a pot of hot oil". The ground truth interval is [114, 121]. CLIP4Clip failed to retrieve the correct interval within the top 5 clips. ViCLIP, with its base and large variants, retrieved the correct interval in the 4th and 5th positions, respectively. VideoCLIP-XL managed to retrieve the correct clip as the first result, demonstrating its powerful retrieval performance.
  • ...and 1 more figures