V-Agent: An Interactive Video Search System Using Vision-Language Models

SunYoung Park; Jong-Hyeon Lee; Youngjune Kim; Daegyu Sung; Younghyun Yu; Young-rok Cha; Jeongho Ju

V-Agent: An Interactive Video Search System Using Vision-Language Models

SunYoung Park, Jong-Hyeon Lee, Youngjune Kim, Daegyu Sung, Younghyun Yu, Young-rok Cha, Jeongho Ju

TL;DR

V-Agent tackles multimodal video search by replacing text-only retrieval with a vision-language-based, cross-modal indexing and retrieval approach. It fine-tunes a VLM on a video-preference dataset and augments it with a retrieval vector derived from an image-text model to better align visual frames and ASR transcripts in a shared space. The system operates as three cooperating agents (routing, search, chat) with an LLM-based re-ranker and a two-stage indexing/search pipeline, achieving state-of-the-art zero-shot results on MultiVENT 2.0 and strong MSR-VTT performance. This work advances interactive, multimodal video information access with practical implications for real-world video search and Q&A tasks.

Abstract

We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based retrieval model together with an additional re-ranking module to further enhance video retrieval quality. Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, highlighting its potential for both academic research and real-world applications.

V-Agent: An Interactive Video Search System Using Vision-Language Models

TL;DR

Abstract

V-Agent: An Interactive Video Search System Using Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)