VCR: Video representation for Contextual Retrieval
Oron Nir, Idan Vidra, Avi Neeman, Barak Kinarti, Ariel Shamir
TL;DR
VCR proposes a multimodal, text-based video representation framework for contextual retrieval by fusing ASR, OCR, and frame-caption signals into latent semantic embeddings indexed offline and queried online via GPT-4–generated descriptions. The approach includes a supervised multilabel classifier and a learning-free GPT-based encoder, both yielding a joint semantic space, and introduces the Topics-Map visualization to support ontology-driven exploration. Evaluations on TED, TDT2, and MSR-VTT show strong multimodal performance and near state-of-the-art results without fine-tuning, with OpenAI embeddings achieving near-perfect MRR on TED tasks. The proposed UX combines exploration and exploitation in a 2D semantic map, enabling scalable, semantically informed video discovery with practical applicability across domains, while acknowledging limitations around indexing latency and privacy considerations.
Abstract
Streamlining content discovery within media archives requires integrating advanced data representations and effective visualization techniques for clear communication of video topics to users. The proposed system addresses the challenge of efficiently navigating large video collections by exploiting a fusion of visual, audio, and textual features to accurately index and categorize video content through a text-based method. Additionally, semantic embeddings are employed to provide contextually relevant information and recommendations to users, resulting in an intuitive and engaging exploratory experience over our topics ontology map using OpenAI GPT-4.
