Table of Contents
Fetching ...

Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

Shubhashis Roy Dipta, Francis Ferraro

TL;DR

Q2E presents a zero-shot multilingual text-to-video retrieval framework that enriches user queries by decomposing them into prequel, current, and sequel events and by generating multimodal video descriptions. It combines LLM-driven event decomposition with VLM-based frame/video captioning and a multilingual ASR pipeline, then fuses five similarity signals through inverse-entropy rank fusion to produce robust rankings without fine-tuning. Across MSR-VTT, MSVD, and MultiVENT, Q2E achieves consistent improvements, especially when audio information is available, and demonstrates language-robust retrieval with different encoders. The work highlights how leveraging latent world knowledge in LLMs and VLMs can significantly improve retrieval performance while outlining future work on efficiency, bias mitigation, and extending video-captioning capabilities.

Abstract

Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.

Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

TL;DR

Q2E presents a zero-shot multilingual text-to-video retrieval framework that enriches user queries by decomposing them into prequel, current, and sequel events and by generating multimodal video descriptions. It combines LLM-driven event decomposition with VLM-based frame/video captioning and a multilingual ASR pipeline, then fuses five similarity signals through inverse-entropy rank fusion to produce robust rankings without fine-tuning. Across MSR-VTT, MSVD, and MultiVENT, Q2E achieves consistent improvements, especially when audio information is available, and demonstrates language-robust retrieval with different encoders. The work highlights how leveraging latent world knowledge in LLMs and VLMs can significantly improve retrieval performance while outlining future work on efficiency, bias mitigation, and extending video-captioning capabilities.

Abstract

Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.

Paper Structure

This paper contains 48 sections, 3 equations, 4 figures, 16 tables.

Figures (4)

  • Figure 1: In Q2E, we extract prequel, current, and sequel events, along with an audio transcript and video description to enrich the query and video context, respectively. This decomposed queries are matched across visual, textual, and speech-based descriptions (matching phrases are highlighted in the same color) enabling the retrieval of the correct videos while effectively filtering out a visually similar but non-relevant video.
  • Figure 2: Q2E, the complete framework of our text-to-video retrieval system. The blue box represents the event decomposition module, while the green box illustrates the video decomposition module. The orange box represents our multi-layer Audio Decomposition module. The purple box fuses the ranks calculated from query-video, query-descriptions, and event-descriptions (event = prequel + current + sequel, description = multimodal description). In the non-ASR variant, the components within the orange box and orange dotted lines are excluded.
  • Figure 3: Comparison of NDCG@10 scores across topic categories (Disasters, Political, Social, Technology) for five languages--Arabic, Chinese, English, Korean, and Russian. The results demonstrate the performance improvement from the baseline to our method (with audio or without), highlighting consistent gains across languages and domains.
  • Figure 4: Impact of the number of frames on retrieval performance. The figure shows the nDCG performance against comparisons per second for three methods: MultiCLIP (black), Q2E (without ASR) (blue), and the Q2E (with ASR) (orange). The point size is proportional to the number of sampled frames (i.e., 2, 4, 8, 16, 32, 64). Comparison per Second = Runtime / (num. of Query $\times$ num. of Video). Results are reported on the MultiVENT dataset. The full table is reported on \ref{['tab:lg_num_of_frames']}.