Table of Contents
Fetching ...

VideoExplorer: Think With Videos For Agentic Long-Video Understanding

Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, Zhicheng Dou

TL;DR

VideoExplorer tackles long-video understanding by enabling a thinking-with-video paradigm that integrates planning, temporal grounding, and scalable perception into a unified reasoning loop. It introduces a reasoning-centric, difficulty-adaptive dataset and a two-stage training pipeline (SFT followed by TDPO) to learn interpretable, multi-step reasoning traces. Empirical results on LVBench, MLVU, and MH-NIAH show consistent improvements over baselines, with better temporal grounding and robustness to complex tasks. The work advances scalable and interpretable LVU with efficient on-demand perception and provides public code to foster reproducibility.

Abstract

Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video'', which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer's significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).

VideoExplorer: Think With Videos For Agentic Long-Video Understanding

TL;DR

VideoExplorer tackles long-video understanding by enabling a thinking-with-video paradigm that integrates planning, temporal grounding, and scalable perception into a unified reasoning loop. It introduces a reasoning-centric, difficulty-adaptive dataset and a two-stage training pipeline (SFT followed by TDPO) to learn interpretable, multi-step reasoning traces. Empirical results on LVBench, MLVU, and MH-NIAH show consistent improvements over baselines, with better temporal grounding and robustness to complex tasks. The work advances scalable and interpretable LVU with efficient on-demand perception and provides public code to foster reproducibility.

Abstract

Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video'', which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer's significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).

Paper Structure

This paper contains 25 sections, 6 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Illustration of VideoExplorer framework. Instead of reasoning over brute-force downsampled frames or static preprocessed database, VideoExplorer integrates planning, temporal grounding, and scalable video perception into a unified reasoning paradigm, where the planner decomposes complex tasks into sub-questions, the temporal grounder adaptively localizes relevant temporal spans, and the perception module dynamically adjusts granularity to meet task demands, enabling faithful, efficient and interpretable long-video understanding.
  • Figure 2: Difficulty-adaptive dataset generation. Tasks are uniformly sampled, reasoning trajectories generated by VideoExplorer, and hard cases re-sampled by first-round accuracy. Only correct-answer trajectories are retained, yielding faithful reasoning trajectories and challenging training data.
  • Figure 3: Data statistics of VideoExplorer dataset.
  • Figure 4: Visual Token Usage Comparison.
  • Figure 5: Case Study of VideoExplorer on LVBench. VideoExplorer correctly decomposes the task, and progressively identifies the relevant video segment through iterative reasoning with temporal grounding. It then performs fine-grained dense perception on this segment to deliver an accurate answer. In contrast, baseline methods all fail on this type of multi-hop fine-grained reasoning task.
  • ...and 3 more figures