ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation
Tony Montes, Fernando Lozano
TL;DR
ViQAgent tackles zero-shot VideoQA by integrating a VideoLLM-based first-sight analyzer, open-vocabulary grounding via YOLO-World, and a Chain-of-Thought judging layer that cross-validates grounded context against object trajectories. The three-module pipeline (M1/M2/OG) produces temporally anchored grounding, yielding interpretable reasoning and refined answers without task-specific fine-tuning. It achieves state-of-the-art zero-shot performance on NExT-QA, iVQA, and ActivityNet-QA, with notable gains over baselines and robust handling of low-resolution/high-motion videos. The approach offers practical gains in reliability and interpretability for multi-modal QA across diverse video domains, with public code available for replication.
Abstract
Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available at https://github.com/t-montes/viqagent.
