Table of Contents
Fetching ...

VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT

Zhuo Zhi, Qiangqiang Wu, Minghe shen, Wenbo Li, Yinchuan Li, Kun Shao, Kaiwen Zhou

TL;DR

VideoAgent2 tackles the challenges of long-form video understanding by integrating an uncertainty-aware plan-adjusted chain-of-thought (CoT) within an LLM-based agent framework. It introduces a four-phase pipeline—general context acquisition, answer assessment, information retrieval planning/adjustment, and information retrieval—where both the LLM and external tools contribute uncertainty estimates to guide reasoning and retrieval. The key contributions include an uncertainty-guided CoT mechanism that requires no extra parameters, a segment-based captioning approach to preserve temporal information, and a modular tool design that improves robustness against noisy tool outputs. Evaluated on EgoSchema, NExT-QA, and IntentQA, VideoAgent2 sets state-of-the-art zero-shot performance and demonstrates strong robustness and efficiency in long-form video QA tasks.

Abstract

Long video understanding has emerged as an increasingly important yet challenging task in computer vision. Agent-based approaches are gaining popularity for processing long videos, as they can handle extended sequences and integrate various tools to capture fine-grained information. However, existing methods still face several challenges: (1) they often rely solely on the reasoning ability of large language models (LLMs) without dedicated mechanisms to enhance reasoning in long video scenarios; and (2) they remain vulnerable to errors or noise from external tools. To address these issues, we propose a specialized chain-of-thought (CoT) process tailored for long video analysis. Our proposed CoT with plan-adjust mode enables the LLM to incrementally plan and adapt its information-gathering strategy. We further incorporate heuristic uncertainty estimation of both the LLM and external tools to guide the CoT process. This allows the LLM to assess the reliability of newly collected information, refine its collection strategy, and make more robust decisions when synthesizing final answers. Empirical experiments show that our uncertainty-aware CoT effectively mitigates noise from external tools, leading to more reliable outputs. We implement our approach in a system called VideoAgent2, which also includes additional modules such as general context acquisition and specialized tool design. Evaluation on three dedicated long video benchmarks (and their subsets) demonstrates that VideoAgent2 outperforms the previous state-of-the-art agent-based method, VideoAgent, by an average of 13.1% and achieves leading performance among all zero-shot approaches

VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT

TL;DR

VideoAgent2 tackles the challenges of long-form video understanding by integrating an uncertainty-aware plan-adjusted chain-of-thought (CoT) within an LLM-based agent framework. It introduces a four-phase pipeline—general context acquisition, answer assessment, information retrieval planning/adjustment, and information retrieval—where both the LLM and external tools contribute uncertainty estimates to guide reasoning and retrieval. The key contributions include an uncertainty-guided CoT mechanism that requires no extra parameters, a segment-based captioning approach to preserve temporal information, and a modular tool design that improves robustness against noisy tool outputs. Evaluated on EgoSchema, NExT-QA, and IntentQA, VideoAgent2 sets state-of-the-art zero-shot performance and demonstrates strong robustness and efficiency in long-form video QA tasks.

Abstract

Long video understanding has emerged as an increasingly important yet challenging task in computer vision. Agent-based approaches are gaining popularity for processing long videos, as they can handle extended sequences and integrate various tools to capture fine-grained information. However, existing methods still face several challenges: (1) they often rely solely on the reasoning ability of large language models (LLMs) without dedicated mechanisms to enhance reasoning in long video scenarios; and (2) they remain vulnerable to errors or noise from external tools. To address these issues, we propose a specialized chain-of-thought (CoT) process tailored for long video analysis. Our proposed CoT with plan-adjust mode enables the LLM to incrementally plan and adapt its information-gathering strategy. We further incorporate heuristic uncertainty estimation of both the LLM and external tools to guide the CoT process. This allows the LLM to assess the reliability of newly collected information, refine its collection strategy, and make more robust decisions when synthesizing final answers. Empirical experiments show that our uncertainty-aware CoT effectively mitigates noise from external tools, leading to more reliable outputs. We implement our approach in a system called VideoAgent2, which also includes additional modules such as general context acquisition and specialized tool design. Evaluation on three dedicated long video benchmarks (and their subsets) demonstrates that VideoAgent2 outperforms the previous state-of-the-art agent-based method, VideoAgent, by an average of 13.1% and achieves leading performance among all zero-shot approaches

Paper Structure

This paper contains 22 sections, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Performance of previous SOTA agent-based method VideoAgent wang2024videoagent, previous SOTA Zero-shot method zhang2024hcqawang2023lifelongmemoryzhang2023simplewang2024videoagentayyubi2025enter and our proposed VideoAgent2 on all evaluation datasets. The metric is accuracy.
  • Figure 2: Overview of VideoAgent2. VideoAgent2 answers a question $Q$ about a video $V$ through a pipeline consisting of four phases: general context acquisition, answer assessment, information retrieval plan creation/adjustment, and information retrieval. Details of each phase are introduced in Section \ref{['sec:method']}.
  • Figure 3: Case study of VideoAgent2. The video and associated question are presented in Fig. \ref{['fig:main_fig']}. Both the popular MLLM Llava-OneVision and the SOTA agent baseline, VideoAgent, fail to provide the correct answer. In contrast, our proposed VideoAgent2 correctly answers the question through three tool calls and four answer assessments. VideoAgent2 leverages the information and uncertainty provided by the tools, enabling the LLM to continuously adjust its information retrieval plan, and make more reliable decisions when synthesizing the final answer.
  • Figure 4: Proportion of samples with different number of tool calls in different datasets. A tool call number of 0 means that for this sample, VideoAgent2 has obtained enough information from the general context information $B_1$ to answer the question without the need for new information retrieval. The maximum tool call number is equal to $T-1$, $T$ is set to 5 in our experiment.
  • Figure 5: The average number of tool calls for each type of question and the average number of calls for each tool in NExT-QA val set.
  • ...and 1 more figures