Table of Contents
Fetching ...

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing

TL;DR

LongVT introduces an end-to-end agentic framework for long-video reasoning that interleaves reasoning with on-demand temporal retrieval via a native crop_video tool, enabling global-to-local analysis and reducing hallucinations. Central to the approach is iMCoTT, which leverages a three-stage training pipeline—cold-start SFT, agentic RL with a joint answer-temporal grounding reward, and agentic RFT—coupled with VideoSIAH, a large-scale, fine-grained data suite for evidence-sparse long-video QA. The method achieves state-of-the-art performance among open-source LMMs on four long-video benchmarks, with VideoSIAH-Eval validating robust temporal grounding and evidence retrieval. The work also provides comprehensive data, methodological details, and analyses, offering practical insights for scalable, reliable long-video reasoning and highlighting the path toward human-aligned, tool-augmented multimodal intelligence.

Abstract

Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

TL;DR

LongVT introduces an end-to-end agentic framework for long-video reasoning that interleaves reasoning with on-demand temporal retrieval via a native crop_video tool, enabling global-to-local analysis and reducing hallucinations. Central to the approach is iMCoTT, which leverages a three-stage training pipeline—cold-start SFT, agentic RL with a joint answer-temporal grounding reward, and agentic RFT—coupled with VideoSIAH, a large-scale, fine-grained data suite for evidence-sparse long-video QA. The method achieves state-of-the-art performance among open-source LMMs on four long-video benchmarks, with VideoSIAH-Eval validating robust temporal grounding and evidence retrieval. The work also provides comprehensive data, methodological details, and analyses, offering practical insights for scalable, reliable long-video reasoning and highlighting the path toward human-aligned, tool-augmented multimodal intelligence.

Abstract

Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

Paper Structure

This paper contains 48 sections, 11 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Interleaved Multimodal Chain-of-Tool-Thought (iMCoTT). Compared to prior text-based Chain-of-Thought (CoT) reasoning, iMCoTT in our proposed LongVT can natively perform self-reflection via callingcrop_video(start_time, end_time)tool. It proposes a time window after a global preview, proactively fetches the corresponding short clip, rethinks based on the new evidence, and determines whether to refine or answer directly. Such tool-augmented reasoning behaviors ground each step in what is actually seen rather than blindly rephrasing in text-only CoT, which mitigates hallucination and leads to enhanced temporal localization and answer correctness.
  • Figure 2: Data Pipeline of VideoSIAH. We construct a semi-automatic data pipeline that integrates several state-of-the-art LMMs bai2025qwen25vlopenai2025o3comanici2025gemini25hong2025glm45v to sequentially perform long video segmentation, video clip captioning, segment-in-a-haystack QA generation, cross-modal QA filtering, and iMCoTT generation. Icons with human silhouettes denote human-in-the-loop validation, where annotators inspect a small set of representative failures to refine prompting rules for QA generation, QA filtering, and iMCoTT generation. Note that iMCoTT traces are generated only for the cold-start SFT stage, whereas RL training operates solely on the filtered QA pairs.
  • Figure 3: Ablations on Reward Design. The left panel shows training dynamics under different accuracy and time rewards, and the right panel shows the effect of tool-call reward on tool usage.
  • Figure 4: Overall Framework of LongVT. Our approach processes long-form videos in a human-like two-stage manner. Specifically, LongVT is augmented with interleaved Multimodal Chain-of-Tool-Thought (iMCoTT): first performs a global skim over sampled video frames to form a coarse hypothesis about when evidence likely occurs; then invokes a native video tool crop_video(start_time, end_time) to resample finer-grained frames from a short clip via a hypothesized window and reasons again. Our model itself determines whether to directly answer after one turn ($T_1$) or continue for multiple turns (up to $T_5$) with self-reflection. During reinforcement learning, we jointly optimize answer correctness ($\textbf{R}_\text{acc}$), clean formatting ($\textbf{R}_\text{format}$), and precise temporal grounding ($\textbf{R}_\text{time}$).
  • Figure 5: Comparison of Watching Strategies Proposed by Gemini 2.5 Pro comanici2025gemini25 and GPT-5 Thinking openai2025gpt5. Best viewed when zoomed in.
  • ...and 9 more figures