LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing
TL;DR
LongVT introduces an end-to-end agentic framework for long-video reasoning that interleaves reasoning with on-demand temporal retrieval via a native crop_video tool, enabling global-to-local analysis and reducing hallucinations. Central to the approach is iMCoTT, which leverages a three-stage training pipeline—cold-start SFT, agentic RL with a joint answer-temporal grounding reward, and agentic RFT—coupled with VideoSIAH, a large-scale, fine-grained data suite for evidence-sparse long-video QA. The method achieves state-of-the-art performance among open-source LMMs on four long-video benchmarks, with VideoSIAH-Eval validating robust temporal grounding and evidence retrieval. The work also provides comprehensive data, methodological details, and analyses, offering practical insights for scalable, reliable long-video reasoning and highlighting the path toward human-aligned, tool-augmented multimodal intelligence.
Abstract
Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .
