Table of Contents
Fetching ...

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, Juan Carlos Niebles

TL;DR

This work reframes long video understanding as an active, evidence-seeking task rather than passive captioning, introducing Active Video Perception (AVP). AVP uses a plan–observe–reflect loop with Planner, Observer, and Reflector modules to selectively observe query-relevant video regions and maintain a structured, time-stamped evidence record. Across five LVU benchmarks, AVP achieves state-of-the-art results among agentic frameworks and general multimodal LLMs, with substantial efficiency gains (notably, a large reduction in inference time and input tokens). Ablation analyses show the planner and reflector are crucial for performance, and the approach robustly benefits from stronger backbones, suggesting strong practical potential for efficient, grounded long-horizon video reasoning. Future work points to embodied or real-time settings where active perception must operate under physical constraints.

Abstract

Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest performance with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average accuracy while only requires 18.4% inference time and 12.4% input tokens.

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

TL;DR

This work reframes long video understanding as an active, evidence-seeking task rather than passive captioning, introducing Active Video Perception (AVP). AVP uses a plan–observe–reflect loop with Planner, Observer, and Reflector modules to selectively observe query-relevant video regions and maintain a structured, time-stamped evidence record. Across five LVU benchmarks, AVP achieves state-of-the-art results among agentic frameworks and general multimodal LLMs, with substantial efficiency gains (notably, a large reduction in inference time and input tokens). Ablation analyses show the planner and reflector are crucial for performance, and the approach robustly benefits from stronger backbones, suggesting strong practical potential for efficient, grounded long-horizon video reasoning. Future work points to embodied or real-time settings where active perception must operate under physical constraints.

Abstract

Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest performance with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average accuracy while only requires 18.4% inference time and 12.4% input tokens.

Paper Structure

This paper contains 41 sections, 6 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Motivation of Active Video Perception. Prior methods follow a passive perception paradigm which leverage query-agonistic captioner to perceive the video information, leading to low efficiency and imprecise visual grounding. Instead, we actively perceive query-relevant content by treating the long video as an interactive environment to be explored in a goal-directed manner.
  • Figure 2: Framework of Active Video Perception (AVP). AVP operates by an iterative plan-observe-reflect process with MLLM agents. At each round, the planner decide what/where/how to interact with the video, the observer extract structured query-related evidence by executing the plan and the reflector evaluates the extracted evidence to decide whether an additional round is need.
  • Figure 3: Qualitative example of AVP. Given a multiple-choice query about the Tombstone monument's first on-screen appearance, Round 1 performs a coarse scan of the entire video (0.5 FPS, low resolution) and localizes a candidate interval [1:00, 1:10], but the Reflector judges the evidence insufficient. Round 2 re-plans a targeted pass over this window (2 FPS, medium resolution), enabling the Observer to localize the monument in the upper-left background and the Reflector to confidently select the correct answer (option D) and halt.
  • Figure 4: Qualitative example of multi-round active perception in AVP (MINERVA sample). Given the query, "After adding up all the millimeter totals on the sheet of paper illustrated at 09:58, and then adding the average length of Louisiana Pine Snake hatchlings according to the video, how many total millimeters are there?", AVP first plans to focus on the local timestamped frame at 09:58 and extracts the seven millimeter totals from the handwritten measurement sheet (Round 1). The reflector correctly judges that this evidence is insufficient because the average hatchling length is still unknown, and triggers a second round. In Round 2, the planner re-directs the observer to uniformly scan the full video at low FPS, locating a narrated segment that states hatchlings "usually range from 4 to 5 feet in length." By fusing the previous numeric evidence with this newly discovered range, the reflector computes the total millimeter interval and selects the correct option.
  • Figure 5: Failure Case of AVP (MINERVA sample). Given a long broadcast basketball video, AVP must answer: “How many three-pointers are made before the second clip of Hawaii versus UCSB?” The planner chooses to scan the entire video at 0.5 FPS with low spatial resolution, the observer summarizes the retrieved segments into a structured evidence list, and the reflector produces a confident answer of two. However, the ground-truth reasoning (yellow box) shows that a three-pointer at 00:20 is missed, so the correct count is three. Although the internal reasoning over the collected evidence is coherent, the initial coarse observation policy fails to capture a short, local event, leading to an overconfident but incorrect prediction.