Table of Contents
Fetching ...

V-Agent: An Interactive Video Search System Using Vision-Language Models

SunYoung Park, Jong-Hyeon Lee, Youngjune Kim, Daegyu Sung, Younghyun Yu, Young-rok Cha, Jeongho Ju

TL;DR

V-Agent tackles multimodal video search by replacing text-only retrieval with a vision-language-based, cross-modal indexing and retrieval approach. It fine-tunes a VLM on a video-preference dataset and augments it with a retrieval vector derived from an image-text model to better align visual frames and ASR transcripts in a shared space. The system operates as three cooperating agents (routing, search, chat) with an LLM-based re-ranker and a two-stage indexing/search pipeline, achieving state-of-the-art zero-shot results on MultiVENT 2.0 and strong MSR-VTT performance. This work advances interactive, multimodal video information access with practical implications for real-world video search and Q&A tasks.

Abstract

We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based retrieval model together with an additional re-ranking module to further enhance video retrieval quality. Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, highlighting its potential for both academic research and real-world applications.

V-Agent: An Interactive Video Search System Using Vision-Language Models

TL;DR

V-Agent tackles multimodal video search by replacing text-only retrieval with a vision-language-based, cross-modal indexing and retrieval approach. It fine-tunes a VLM on a video-preference dataset and augments it with a retrieval vector derived from an image-text model to better align visual frames and ASR transcripts in a shared space. The system operates as three cooperating agents (routing, search, chat) with an LLM-based re-ranker and a two-stage indexing/search pipeline, achieving state-of-the-art zero-shot results on MultiVENT 2.0 and strong MSR-VTT performance. This work advances interactive, multimodal video information access with practical implications for real-world video search and Q&A tasks.

Abstract

We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based retrieval model together with an additional re-ranking module to further enhance video retrieval quality. Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, highlighting its potential for both academic research and real-world applications.

Paper Structure

This paper contains 25 sections, 11 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of search results for the query "Find a person wearing a blue shirt giving a presentation". YouTube returns unrelated videos, whereas our system retrieves semantically relevant results by analyzing visual content.
  • Figure 2: Video Search Pipeline. The VLM-based retrieval model ($M_R$) embeds videos, audio, and video descriptions during both indexing and query processing. Upon receiving a query, the system retrieves the most relevant videos and transcriptions. The results are then combined using a weighted average score fusion method.
  • Figure 3: Video-Text Retrieval Model Overview. We fine-tune a Model using the video preference dataset, and then add a retrieval vector $\tau$ to the model to enhance the overall vision-text retrieval capability. $\tau$ is computed as the parameter difference between an image-text retrieval model (GME) and the original Qwen2-VL-Instruct model.
  • Figure 4: V-Agent Pipeline. When a user submits a query intending to search for videos, the routing agent forwards the request to the search agent. The search agent employs the VLM-based video-text retrieval model to retrieve the most relevant video frames and transcribed texts. The results from each modality are then fused using a scoring fusion approach, and the top-$k$ video list is refined by an LLM re-ranking module. Once relevant results are retrieved, users can freely select videos of interest and ask questions about them. Otherwise, the routing agent directly calls the chat agent.
  • Figure 5: V-Agent Use Case Examples.