Table of Contents
Fetching ...

Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding

Hong Gao, Yiming Bao, Xuezhen Tu, Yutong Xu, Yue Jin, Yiyang Mu, Bin Zhong, Linan Yue, Min-Ling Zhang

TL;DR

The paper tackles the complexity of long-horizon video understanding by proposing Agentic Video Intelligence (AVI), a training-free framework that emulates human-like cognition through a three-phase Retrieve-Perceive-Review reasoning loop and a structured, multi-granularity knowledge base. AVI leverages an open-source model ensemble and a rich environment consisting of clip-level captions, embeddings, and an entity-centric temporal knowledge graph to enable interpretable, tool-assisted reasoning without reliance on proprietary APIs or RL training. Empirical results on LVBench, VideoMME-Long, LongVideoBench, and Charades-STA show competitive performance and strong temporal grounding, alongside superior interpretability and cost-efficiency compared to RL-trained or monolithic VLM baselines. The work demonstrates that careful architectural design and modular tool-enabled reasoning can achieve robust video understanding while improving reproducibility and accessibility for the research community.

Abstract

Video understanding requires not only visual recognition but also complex reasoning. While Vision-Language Models (VLMs) demonstrate impressive capabilities, they typically process videos largely in a single-pass manner with limited support for evidence revisit and iterative refinement. While recently emerging agent-based methods enable long-horizon reasoning, they either depend heavily on expensive proprietary models or require extensive agentic RL training. To overcome these limitations, we propose Agentic Video Intelligence (AVI), a flexible and training-free framework that can mirror human video comprehension through system-level design and optimization. AVI introduces three key innovations: (1) a human-inspired three-phase reasoning process (Retrieve-Perceive-Review) that ensures both sufficient global exploration and focused local analysis, (2) a structured video knowledge base organized through entity graphs, along with multi-granularity integrated tools, constituting the agent's interaction environment, and (3) an open-source model ensemble combining reasoning LLMs with lightweight base CV models and VLM, eliminating dependence on proprietary APIs or RL training. Experiments on LVBench, VideoMME-Long, LongVideoBench, and Charades-STA demonstrate that AVI achieves competitive performance while offering superior interpretability.

Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding

TL;DR

The paper tackles the complexity of long-horizon video understanding by proposing Agentic Video Intelligence (AVI), a training-free framework that emulates human-like cognition through a three-phase Retrieve-Perceive-Review reasoning loop and a structured, multi-granularity knowledge base. AVI leverages an open-source model ensemble and a rich environment consisting of clip-level captions, embeddings, and an entity-centric temporal knowledge graph to enable interpretable, tool-assisted reasoning without reliance on proprietary APIs or RL training. Empirical results on LVBench, VideoMME-Long, LongVideoBench, and Charades-STA show competitive performance and strong temporal grounding, alongside superior interpretability and cost-efficiency compared to RL-trained or monolithic VLM baselines. The work demonstrates that careful architectural design and modular tool-enabled reasoning can achieve robust video understanding while improving reproducibility and accessibility for the research community.

Abstract

Video understanding requires not only visual recognition but also complex reasoning. While Vision-Language Models (VLMs) demonstrate impressive capabilities, they typically process videos largely in a single-pass manner with limited support for evidence revisit and iterative refinement. While recently emerging agent-based methods enable long-horizon reasoning, they either depend heavily on expensive proprietary models or require extensive agentic RL training. To overcome these limitations, we propose Agentic Video Intelligence (AVI), a flexible and training-free framework that can mirror human video comprehension through system-level design and optimization. AVI introduces three key innovations: (1) a human-inspired three-phase reasoning process (Retrieve-Perceive-Review) that ensures both sufficient global exploration and focused local analysis, (2) a structured video knowledge base organized through entity graphs, along with multi-granularity integrated tools, constituting the agent's interaction environment, and (3) an open-source model ensemble combining reasoning LLMs with lightweight base CV models and VLM, eliminating dependence on proprietary APIs or RL training. Experiments on LVBench, VideoMME-Long, LongVideoBench, and Charades-STA demonstrate that AVI achieves competitive performance while offering superior interpretability.

Paper Structure

This paper contains 38 sections, 13 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison of video understanding paradigms. (a)VLM-based methods process video frames in a single-pass manner. (b) Current agent-based methods use ReAct loops to interact with environment, but requiring LMs API or RL training. (c) Our AVI imitates human intelligence with three-phase reasoning, builds structured environment, and employs open-source models.
  • Figure 2: The AVI framework architecture. The structured video database contains clip captions, embeddings, entity graphs, and raw frames. The agentic core implements three-phase reasoning (Retrieve-Perceive-Review) with internal Think-Action-Observation nodes and two tool suites:(1) Retrieve Tools for global exploration and segment location; (2) Perceive Tools for local visual analysis, powered by base CV models and open-source VLM. Our AVI iteratively gathers evidence (Retrieve and Perceive phase) and determines whether to output or return back for refinement (Review phase). This design enables interpretable, training-free video understanding.
  • Figure 3: Qualitative examples of AVI's reasoning traces on complex video understanding tasks. The trace shows how AVI combines text-based retrieval with visual analysis to identify both the temporal segment and spatial semantics. Each phase is clearly delineated, demonstrating the interpretability of our approach.
  • Figure 4: Analysis on the (a) average time per phase and (b) failure case distribution on LVBench.
  • Figure 5: The whole system prompt of AVI.
  • ...and 7 more figures