Table of Contents
Fetching ...

EEA: Exploration-Exploitation Agent for Long Video Understanding

Te Yang, Xiangyu Zhu, Bo Wang, Quan Chen, Peng Jiang, Zhen Lei

TL;DR

The paper tackles long-form video understanding by introducing EEA, a framework that harmonizes exploration and exploitation through semantic guidance. It integrates Dynamic Query Management for evolving semantic priors, Uncertainty-Aware Reward Fusion to stabilize long-horizon evaluation, and Semantic Guided Tree Search to focus expansion around informative anchors while maintaining coverage. Empirical results on EgoSchema and LVBench show state-of-the-art accuracy with substantially fewer observed frames, validating both effectiveness and efficiency. The work advances scalable, semantically informed video reasoning and provides a recipe for model-agnostic deployment with strong practical impact.

Abstract

Long-form video understanding requires efficient navigation of extensive visual data to pinpoint sparse yet critical information. Current approaches to longform video understanding either suffer from severe computational overhead due to dense preprocessing, or fail to effectively balance exploration and exploitation, resulting in incomplete information coverage and inefficiency. In this work, we introduce EEA, a novel video agent framework that archives exploration-exploitation balance through semantic guidance with hierarchical tree search process. EEA autonomously discovers and dynamically updates task-relevant semantic queries, and collects video frames closely matched to these queries as semantic anchors. During the tree search process, instead of uniform expansion, EEA preferentially explores semantically relevant frames while ensuring sufficient coverage within unknown segments. Moreover, EEA adaptively combines intrinsic rewards from visionlanguage models (VLMs) with semantic priors by explicitly modeling uncertainty to achieve stable and precise evaluation of video segments. Experiments across various long-video benchmarks validate the superior performance and computational efficiency of our proposed method.

EEA: Exploration-Exploitation Agent for Long Video Understanding

TL;DR

The paper tackles long-form video understanding by introducing EEA, a framework that harmonizes exploration and exploitation through semantic guidance. It integrates Dynamic Query Management for evolving semantic priors, Uncertainty-Aware Reward Fusion to stabilize long-horizon evaluation, and Semantic Guided Tree Search to focus expansion around informative anchors while maintaining coverage. Empirical results on EgoSchema and LVBench show state-of-the-art accuracy with substantially fewer observed frames, validating both effectiveness and efficiency. The work advances scalable, semantically informed video reasoning and provides a recipe for model-agnostic deployment with strong practical impact.

Abstract

Long-form video understanding requires efficient navigation of extensive visual data to pinpoint sparse yet critical information. Current approaches to longform video understanding either suffer from severe computational overhead due to dense preprocessing, or fail to effectively balance exploration and exploitation, resulting in incomplete information coverage and inefficiency. In this work, we introduce EEA, a novel video agent framework that archives exploration-exploitation balance through semantic guidance with hierarchical tree search process. EEA autonomously discovers and dynamically updates task-relevant semantic queries, and collects video frames closely matched to these queries as semantic anchors. During the tree search process, instead of uniform expansion, EEA preferentially explores semantically relevant frames while ensuring sufficient coverage within unknown segments. Moreover, EEA adaptively combines intrinsic rewards from visionlanguage models (VLMs) with semantic priors by explicitly modeling uncertainty to achieve stable and precise evaluation of video segments. Experiments across various long-video benchmarks validate the superior performance and computational efficiency of our proposed method.

Paper Structure

This paper contains 25 sections, 4 equations, 8 figures, 11 tables, 1 algorithm.

Figures (8)

  • Figure 1: Accuracy versus. Computational Cost and Frame Utilization. Our agent achieves higher accuracy with improved efficiency, requiring fewer observed frames and comparable or lower GPU cost on EgoSchema, under both GPT-4o and Qwen2-VL-72B settings.
  • Figure 2: Comparison of EEA with prior works.Difference-1: Prior methods use no semantic priors. EEA performs dynamic query discovery and query update during exploration, enabling progressively refined guidance over long videos. Difference-2: EEA performs semantic-guided expansion, leveraging semantic queries and anchors to focus on relevant frames. In contrast, previous methods rely on blind uniform sampling, risking the omission of critical events. Difference-3:: Previous methods base exploration decisions solely on a potentially noisy intrinsic reward. EEA enhances more stable evaluation by fusing this reward with a robust query-based score.
  • Figure 3: Pipeline of EEA Framework. The agent first derives semantic queries from the query and identifies their corresponding semantic anchors in the video. Guided by these anchors, it performs semantic-guided expansion to expand candidate nodes. It then evaluates each node with a fused reward obtained through uncertainty-aware reward fusion, combining intrinsic reward and query score. Then the agent decides its next action based on obtained information and updates the semantic queries, anchors, and memory buffer.
  • Figure 4: Comparison of Entropy Distribution between Intrinsic Reward and Fused Reward on LVBench. The y-axis represents the value of the entropy, and the x-axis represents the probability density. The red and blue curves represent fitted normal distributions.
  • Figure 5: Exploration Trajectory Example. Compared with VCA, our agent can rapidly pinpoint critical information in each segment with the guidance of semantic queries and anchors. Moreover, even when the reward model becomes unreliable, the agent can still produce a discriminative fused reward by integrating the query score.
  • ...and 3 more figures