Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs)
Yicheng Duan, Xi Huang, Duo Chen
TL;DR
The paper addresses the challenge of adaptive, time-sensitive video retrieval in long-form content by fusing Vision-Language Model embeddings with a graph-based metadata layer and fast vector search. It introduces a hybrid pipeline that uses a VLM to generate contextual embeddings from frames and transcripts, stores vectors in Pinecone, and manages frame metadata in Neo4j, enabling efficient, context-aware retrieval and robust cross-video capabilities. Key contributions include a prompt-engineering strategy for better embeddings, four embedding methods, and a retrieval workflow that assembles coherent video clips and summaries, validated on the Redhen TV Show dataset. The approach demonstrates scalable performance, effective summarization, and practical applicability to multilingual and cross-domain retrieval tasks in dynamic video environments.
Abstract
The rapid growth of video content demands efficient and precise retrieval systems. While vision-language models (VLMs) excel in representation learning, they often struggle with adaptive, time-sensitive video retrieval. This paper introduces a novel framework that combines vector similarity search with graph-based data structures. By leveraging VLM embeddings for initial retrieval and modeling contextual relationships among video segments, our approach enables adaptive query refinement and improves retrieval accuracy. Experiments demonstrate its precision, scalability, and robustness, offering an effective solution for interactive video retrieval in dynamic environments.
