Table of Contents
Fetching ...

ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation

Tony Montes, Fernando Lozano

TL;DR

ViQAgent tackles zero-shot VideoQA by integrating a VideoLLM-based first-sight analyzer, open-vocabulary grounding via YOLO-World, and a Chain-of-Thought judging layer that cross-validates grounded context against object trajectories. The three-module pipeline (M1/M2/OG) produces temporally anchored grounding, yielding interpretable reasoning and refined answers without task-specific fine-tuning. It achieves state-of-the-art zero-shot performance on NExT-QA, iVQA, and ActivityNet-QA, with notable gains over baselines and robust handling of low-resolution/high-motion videos. The approach offers practical gains in reliability and interpretability for multi-modal QA across diverse video domains, with public code available for replication.

Abstract

Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available at https://github.com/t-montes/viqagent.

ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation

TL;DR

ViQAgent tackles zero-shot VideoQA by integrating a VideoLLM-based first-sight analyzer, open-vocabulary grounding via YOLO-World, and a Chain-of-Thought judging layer that cross-validates grounded context against object trajectories. The three-module pipeline (M1/M2/OG) produces temporally anchored grounding, yielding interpretable reasoning and refined answers without task-specific fine-tuning. It achieves state-of-the-art zero-shot performance on NExT-QA, iVQA, and ActivityNet-QA, with notable gains over baselines and robust handling of low-resolution/high-motion videos. The approach offers practical gains in reliability and interpretability for multi-modal QA across diverse video domains, with public code available for replication.

Abstract

Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available at https://github.com/t-montes/viqagent.

Paper Structure

This paper contains 18 sections, 1 equation, 10 figures, 18 tables, 1 algorithm.

Figures (10)

  • Figure 1: An overview of our ViQAgent framework. Through three main modules, we propose an agentic solution for the Video Question-Answering (VideoQA) task by taking advantage of most advanced VideoLLMs capabilities on first-sight zero-shot reasoning, timeframe captioning, and target identification ($M_1$), and the open-vocabulary capabilities of YOLO-World to ground the given targets/objects in the video ($OG$) in specific parts of the video in between $t_0$ and $t_f$; to finally end with a Chain-of-Thoughts judgment and reasoning layer $(M_2)$ that compares both the grounded context and grounded object detections to determine the confidence of the $M_1$ answer. In case of discrepancy, the CoT judge defines a set of clarification questions in specific timeframes that go through the VideoLLM again for specific short-ended question-answering. Finally, a reasoning layer takes these answers and the original question to produce a grounded and more accurate answer.
  • Figure 2: An outline of the black-boxed ViQAgent framework modules inputs and outputs, and the intermediate representations, that allow to track and understand the final selected answer. The ($M_1$)inputs are the video and the question plus the answer options (namely prompt). In contrast, the outputs are the open-vocabulary targets, and the reasoning plus timeframe captions (namely Grounded Context). The ($OG$)inputs are the targets and the video, and the output is the object detection timeline (namely Grounded Objects). Finally, the $(M_2)$ first receives both ground responses and the prompt, then, if there seem to be inconsistencies, returns a doubtful timeframe and a set of clarification questions to make to the VideoLLM from that specific timeframe. The answers are then re-inputted to produce the final answer.
  • Figure 3: A more detailed overview of the internal process of the $OG$ module. The process begins by extracting all frames from the input video $V$. For each frame $v_i$, the YOLO-World model detects specified target classes within the frame, using predetermined confidence and NMS thresholds ($\tau_c, \tau_{nms}$). After detection, these classes are tracked across all frames to establish the exact time intervals during which they are present. If a detected object is absent from subsequent frames for a specified duration $\tau_t$, it is assumed to have exited the scene, marking the end of its appearance.
  • Figure 4: VideoLLM Analyzer(VideoLLM$_1$): Given the full video, and the "prompt" (question plus answer options, if available), the VideoLLM Analyzer submodule provides a first-sight response with a reasoning text of why that answer is correct.
  • Figure 5: VideoLLM Captioner(VideoLLM$_2$): Given the full video, but not the question (to avoid bias), the VideoLLM Captioner submodule provides a comprehensive set of event-separated timeframes with a description (i.e. caption) of what is happening in every part of the video. This is the first grounding output, used then for comparison against YOLO-World object grounding.
  • ...and 5 more figures