Table of Contents
Fetching ...

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan

Abstract

The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, the temporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent's evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5\% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Abstract

The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, the temporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent's evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5\% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.

Paper Structure

This paper contains 55 sections, 2 equations, 21 figures, 8 tables.

Figures (21)

  • Figure 1: Comparison of video understanding paradigms: (a) Single Forward of model with one-off context selection. (b) Retrieval-Based Video Agent. (c) LensWalk, an agent framework that actively plans observations to fulfill subsequent reasoning, explores iteratively, and self-regulates its observation budget. (d) By actively planning observations only on video segments essential to its reasoning process, LensWalk simultaneously achieves high accuracy and exceptional efficiency.
  • Figure 2: Illustration of LensWalk’s reasoning-scheduled active observation on a real trace. The LLM reasoner alternates Scan Search and Segment Focus to seek question-relevant evidence, initially misclassifies a mole as the queried water vole and records conflicting memories in the subject table (red text), then invokes Stitch Verify to inspect key segments, resolve the contradiction, and arrive at the correct answer.
  • Figure 3: Comparison of video understanding methods on accuracy and token efficiency. Abbrev. of Gemini: Gem.
  • Figure 4: Tool-call behavior of LensWalk across model recipes, visualized following zhang2025deep. We group behaviour into six strategy types and show their ratio (sector angle), average frame cost (dashed polygon), and accuracy (noted as score; sector radius), revealing adaptive allocation of more frames to harder queries and diverse exploration under uncertainty.
  • Figure 5: System prompt used for the LensWalk Reasoner.
  • ...and 16 more figures