Table of Contents
Fetching ...

Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding

Zheng Wang, Haoran Chen, Haoxuan Qin, Zhipeng Wei, Tianwen Qian, Cong Bai

TL;DR

VideoHV-Agent is a framework that reformulates video question answering as a structured hypothesis-verification process that achieves state-of-the-art accuracy while providing enhanced interpretability, improved logical soundness, and lower computational cost.

Abstract

Long video understanding is challenging due to dense visual redundancy, long-range temporal dependencies, and the tendency of chain-of-thought and retrieval-based agents to accumulate semantic drift and correlation-driven errors. We argue that long-video reasoning should begin not with reactive retrieval, but with deliberate task formulation: the model must first articulate what must be true in the video for each candidate answer to hold. This thinking-before-finding principle motivates VideoHV-Agent, a framework that reformulates video question answering as a structured hypothesis-verification process. Based on video summaries, a Thinker rewrites answer candidates into testable hypotheses, a Judge derives a discriminative clue specifying what evidence must be checked, a Verifier grounds and tests the clue using localized, fine-grained video content, and an Answer agent integrates validated evidence to produce the final answer. Experiments on three long-video understanding benchmarks show that VideoHV-Agent achieves state-of-the-art accuracy while providing enhanced interpretability, improved logical soundness, and lower computational cost. We make our code publicly available at: https://github.com/Haorane/VideoHV-Agent.

Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding

TL;DR

VideoHV-Agent is a framework that reformulates video question answering as a structured hypothesis-verification process that achieves state-of-the-art accuracy while providing enhanced interpretability, improved logical soundness, and lower computational cost.

Abstract

Long video understanding is challenging due to dense visual redundancy, long-range temporal dependencies, and the tendency of chain-of-thought and retrieval-based agents to accumulate semantic drift and correlation-driven errors. We argue that long-video reasoning should begin not with reactive retrieval, but with deliberate task formulation: the model must first articulate what must be true in the video for each candidate answer to hold. This thinking-before-finding principle motivates VideoHV-Agent, a framework that reformulates video question answering as a structured hypothesis-verification process. Based on video summaries, a Thinker rewrites answer candidates into testable hypotheses, a Judge derives a discriminative clue specifying what evidence must be checked, a Verifier grounds and tests the clue using localized, fine-grained video content, and an Answer agent integrates validated evidence to produce the final answer. Experiments on three long-video understanding benchmarks show that VideoHV-Agent achieves state-of-the-art accuracy while providing enhanced interpretability, improved logical soundness, and lower computational cost. We make our code publicly available at: https://github.com/Haorane/VideoHV-Agent.
Paper Structure (29 sections, 6 figures, 7 tables, 1 algorithm)

This paper contains 29 sections, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: From correlation-based search to hypothesis verification: prior agents search and aggregate related clips, while VideoHV-Agent verifies testable clues with focused visual evidence.
  • Figure 2: Overview of the proposed VideoHV-Agent framework. The framework first (a) summarizes the long video captions, then performs (b) two-step reasoning where a Thinker and a Judge agent rewrite options into hypotheses and a discriminative clue, and a Verifier agent grounds this clue to collect visual evidence, finally an Answer agent integrates the evidence to (c) answer the question.
  • Figure 3: Ablations of different maximum number of loops.
  • Figure 4: Proportion of samples with different numbers of loops.
  • Figure 5: Qualitative study of event understanding in long videos. VideoHV-Agent uses hypothesis–verification to locate decisive evidence, highlighting its ability to avoid search purposefully and ground conclusions in explicit visual proof.
  • ...and 1 more figures