Table of Contents
Fetching ...

RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval

Long Nguyen, Huy Nguyen, Bao Khuu, Huy Luu, Huy Le, Tuan Nguyen, Tho Quan

TL;DR

RAPID addresses the challenge of text-based video event retrieval under context-poor queries by augmenting user input with context via Large Language Models and prompting strategies. It employs a retrieval-augmented parallel inference pipeline that embeds text and keyframes into a shared $d$-dimensional space, uses cosine similarity for parallel top-$k$ retrieval per augmented draft, and re-ranks the final candidates against the original query, with OCR-based filtering and a user-friendly interface for disambiguation. Empirical results on a 300-hour news-video dataset show that location-aware augmentation improves retrieval performance, with CLIP-based embeddings (notably CLIP-ViT-L/14) delivering strong gains and RAPID outperforming the competition baseline in overall MRR. The work demonstrates practical speed and accuracy benefits for context-deficient queries and lays groundwork for incorporating additional modalities and model fine-tuning to further enhance performance in real-world large-scale video archives.

Abstract

Retrieving events from videos using text queries has become increasingly challenging due to the rapid growth of multimedia content. Existing methods for text-based video event retrieval often focus heavily on object-level descriptions, overlooking the crucial role of contextual information. This limitation is especially apparent when queries lack sufficient context, such as missing location details or ambiguous background elements. To address these challenges, we propose a novel system called RAPID (Retrieval-Augmented Parallel Inference Drafting), which leverages advancements in Large Language Models (LLMs) and prompt-based learning to semantically correct and enrich user queries with relevant contextual information. These enriched queries are then processed through parallel retrieval, followed by an evaluation step to select the most relevant results based on their alignment with the original query. Through extensive experiments on our custom-developed dataset, we demonstrate that RAPID significantly outperforms traditional retrieval methods, particularly for contextually incomplete queries. Our system was validated for both speed and accuracy through participation in the Ho Chi Minh City AI Challenge 2024, where it successfully retrieved events from over 300 hours of video. Further evaluation comparing RAPID with the baseline proposed by the competition organizers demonstrated its superior effectiveness, highlighting the strength and robustness of our approach.

RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval

TL;DR

RAPID addresses the challenge of text-based video event retrieval under context-poor queries by augmenting user input with context via Large Language Models and prompting strategies. It employs a retrieval-augmented parallel inference pipeline that embeds text and keyframes into a shared -dimensional space, uses cosine similarity for parallel top- retrieval per augmented draft, and re-ranks the final candidates against the original query, with OCR-based filtering and a user-friendly interface for disambiguation. Empirical results on a 300-hour news-video dataset show that location-aware augmentation improves retrieval performance, with CLIP-based embeddings (notably CLIP-ViT-L/14) delivering strong gains and RAPID outperforming the competition baseline in overall MRR. The work demonstrates practical speed and accuracy benefits for context-deficient queries and lays groundwork for incorporating additional modalities and model fine-tuning to further enhance performance in real-world large-scale video archives.

Abstract

Retrieving events from videos using text queries has become increasingly challenging due to the rapid growth of multimedia content. Existing methods for text-based video event retrieval often focus heavily on object-level descriptions, overlooking the crucial role of contextual information. This limitation is especially apparent when queries lack sufficient context, such as missing location details or ambiguous background elements. To address these challenges, we propose a novel system called RAPID (Retrieval-Augmented Parallel Inference Drafting), which leverages advancements in Large Language Models (LLMs) and prompt-based learning to semantically correct and enrich user queries with relevant contextual information. These enriched queries are then processed through parallel retrieval, followed by an evaluation step to select the most relevant results based on their alignment with the original query. Through extensive experiments on our custom-developed dataset, we demonstrate that RAPID significantly outperforms traditional retrieval methods, particularly for contextually incomplete queries. Our system was validated for both speed and accuracy through participation in the Ho Chi Minh City AI Challenge 2024, where it successfully retrieved events from over 300 hours of video. Further evaluation comparing RAPID with the baseline proposed by the competition organizers demonstrated its superior effectiveness, highlighting the strength and robustness of our approach.

Paper Structure

This paper contains 12 sections, 7 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: An illustration of RAPID's UI for the textual query $Q_0$: A monk is writing, where $n = 4$ augmented queries are selected from $N = 6$ generated drafts, and the parameter $K = 600$ specifies the number of final keyframes. The relevant result, highlighted in green, is displayed among the top-ranked keyframes.
  • Figure 2: The user can review frames adjacent to the selected keyframe to check for accuracy before pressing Submit and can view the complete video containing it by pressing the Youtube button.