VDMA: Video Question Answering with Dynamically Generated Multi-Agents
Noriyuki Kugo, Tatsuya Ishibashi, Kosuke Ono, Yuji Sato
TL;DR
The paper tackles long-form video question answering on the EgoSchema dataset by introducing VDMA, a two-stage framework that dynamically generates expert agents (Stage 1) and performs multi-agent QA with an organizer (Stage 2). By tailoring prompts to the video context and question, and leveraging specialized tools (Captioner and Video Analyzer), the approach achieves 70.7% accuracy on EgoSchema, outperforming a single-agent baseline. Ablation studies show gains from domain-specific expert generation and benefits and trade-offs related to frame count. The work demonstrates that dynamically assembled expert ensembles can improve VQA performance, at the cost of higher computation, and suggests future enhancements via agent debates and smarter tool usage.
Abstract
This technical report provides a detailed description of our approach to the EgoSchema Challenge 2024. The EgoSchema Challenge aims to identify the most appropriate responses to questions regarding a given video clip. In this paper, we propose Video Question Answering with Dynamically Generated Multi-Agents (VDMA). This method is a complementary approach to existing response generation systems by employing a multi-agent system with dynamically generated expert agents. This method aims to provide the most accurate and contextually appropriate responses. This report details the stages of our approach, the tools employed, and the results of our experiments.
