Table of Contents
Fetching ...

VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

Jingyang Lin, Jialian Wu, Jiang Liu, Ximeng Sun, Ze Wang, Xiaodong Yu, Jiebo Luo, Zicheng Liu, Emad Barsoum

Abstract

Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding capability. VideoSeek operates in a think-act-observe loop with a well-designed toolkit for collecting multi-granular video observations. This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning. Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Further analysis highlights the significance of leveraging video logic flow, strong reasoning capability, and the complementary roles of toolkit design.

VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

Abstract

Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding capability. VideoSeek operates in a think-act-observe loop with a well-designed toolkit for collecting multi-granular video observations. This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning. Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Further analysis highlights the significance of leveraging video logic flow, strong reasoning capability, and the complementary roles of toolkit design.
Paper Structure (17 sections, 4 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 17 sections, 4 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of VideoSeek. Left: VideoSeek is a long-horizon video agent that actively seeks answer-critical evidence, guided by video logic flow. Given a query and a video, a thinking LLM reasons over accumulated observations, plans the next step, and selects a tool from the toolkit. The selected tool gathers new evidence from the video ( : viewed frames; : unseen frames), which is fed back to the thinking LLM in a think–act–observe loop until sufficient evidence is collected to produce the final answer. Right: Accuracy vs. number of viewed frames on LVBench wang2025lvbench. $\bullet$ denote video agentic models and $\bullet$ denote standalone LMMs. $\blacktriangle$ VideoSeek (w/ subtitles) achieves the best performance while processing only about $1/300$ as many frames as the second-best video agent.
  • Figure 2: Toolkit of the VideoSeek agent, including <overview>, <skim>, and <focus> tools. Left: <overview> rapidly scans the entire video to build a coarse storyline and highlight promising intervals. Middle: <skim> takes a quick glance at these candidate intervals (i.e., $t_1$ to $t_2$) at low cost to check whether query-relevant evidence is nearby. Right: <focus> zooms in on a fine-grained clip (i.e., $t_3$ to $t_4$) with dense inspection to obtain answer-critical observations. Red, blue, and green boxes denote frames viewed by <overview>, <skim>, and <focus>, respectively, while gray boxes indicate unseen frames.
  • Figure 3: Case study from LVBench wang2025lvbench (uid: 1671) when applying VideoSeek agent. The example illustrates how the VideoSeek follows a think–act–observe loop, reasoning over accumulating observations, then actively invoking <overview>, <skim>, and <focus> tools to inspect only a small subset of frames that are most relevant to the query.
  • Figure 4: Prompt for the initial user query. Blue text denotes variables.
  • Figure 5: Instruction at the beginning of each step. Blue text denotes variables.
  • ...and 6 more figures