Table of Contents
Fetching ...

Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs)

Yicheng Duan, Xi Huang, Duo Chen

TL;DR

The paper addresses the challenge of adaptive, time-sensitive video retrieval in long-form content by fusing Vision-Language Model embeddings with a graph-based metadata layer and fast vector search. It introduces a hybrid pipeline that uses a VLM to generate contextual embeddings from frames and transcripts, stores vectors in Pinecone, and manages frame metadata in Neo4j, enabling efficient, context-aware retrieval and robust cross-video capabilities. Key contributions include a prompt-engineering strategy for better embeddings, four embedding methods, and a retrieval workflow that assembles coherent video clips and summaries, validated on the Redhen TV Show dataset. The approach demonstrates scalable performance, effective summarization, and practical applicability to multilingual and cross-domain retrieval tasks in dynamic video environments.

Abstract

The rapid growth of video content demands efficient and precise retrieval systems. While vision-language models (VLMs) excel in representation learning, they often struggle with adaptive, time-sensitive video retrieval. This paper introduces a novel framework that combines vector similarity search with graph-based data structures. By leveraging VLM embeddings for initial retrieval and modeling contextual relationships among video segments, our approach enables adaptive query refinement and improves retrieval accuracy. Experiments demonstrate its precision, scalability, and robustness, offering an effective solution for interactive video retrieval in dynamic environments.

Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs)

TL;DR

The paper addresses the challenge of adaptive, time-sensitive video retrieval in long-form content by fusing Vision-Language Model embeddings with a graph-based metadata layer and fast vector search. It introduces a hybrid pipeline that uses a VLM to generate contextual embeddings from frames and transcripts, stores vectors in Pinecone, and manages frame metadata in Neo4j, enabling efficient, context-aware retrieval and robust cross-video capabilities. Key contributions include a prompt-engineering strategy for better embeddings, four embedding methods, and a retrieval workflow that assembles coherent video clips and summaries, validated on the Redhen TV Show dataset. The approach demonstrates scalable performance, effective summarization, and practical applicability to multilingual and cross-domain retrieval tasks in dynamic video environments.

Abstract

The rapid growth of video content demands efficient and precise retrieval systems. While vision-language models (VLMs) excel in representation learning, they often struggle with adaptive, time-sensitive video retrieval. This paper introduces a novel framework that combines vector similarity search with graph-based data structures. By leveraging VLM embeddings for initial retrieval and modeling contextual relationships among video segments, our approach enables adaptive query refinement and improves retrieval accuracy. Experiments demonstrate its precision, scalability, and robustness, offering an effective solution for interactive video retrieval in dynamic environments.

Paper Structure

This paper contains 19 sections, 12 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Recall vs Extract Interval and Embedding Method
  • Figure 2: The frame at 900 seconds into the video, the system responds as follows: "The image shows a news segment from CNN featuring a female reporter reporting on a tragic event. The reporter is discussing a shooting that occurred during a New Year's Eve celebration in Mobile, Alabama. The shooting resulted in one person being killed and at least nine others being injured. The reporter is also reporting on a separate story about a machete attack near Times Square in New York City, where a mayor has commended the response to the incident. The segment also includes footage of flooding in California caused by a drenching storm."