Table of Contents
Fetching ...

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

Wenqi Liu, Yunxiao Wang, Shijie Ma, Meng Liu, Qile Su, Tianke Zhang, Haonan Fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Yinwei Wei, Xuemeng Song

TL;DR

VideoTemp-o3 addresses the challenge of long-video understanding by unifying temporal grounding with VideoQA in a single agentic model that can crop on demand and iteratively refine its grounding. It introduces a cold-start SFT regime with a unified masking strategy and a penalty-aware RL framework (GRPO) with dedicated rewards to boost grounding accuracy while reducing reward hacking. A data construction pipeline and VideoTemp-Bench are proposed to equip and evaluate models on long-video grounded QA across diverse durations. Empirical results show state-of-the-art performance on multiple long-video QA and grounding benchmarks, validating the effectiveness of on-demand clipping and iterative grounding in enhancing video understanding. The work lays a foundation for scalable, interpretable thinking-with-videos and suggests avenues for integrating additional external tools in pragmatic settings.

Abstract

In long-video understanding, conventional uniform frame sampling often fails to capture key visual evidence, leading to degraded performance and increased hallucinations. To address this, recent agentic thinking-with-videos paradigms have emerged, adopting a localize-clip-answer pipeline in which the model actively identifies relevant video segments, performs dense sampling within those clips, and then produces answers. However, existing methods remain inefficient, suffer from weak localization, and adhere to rigid workflows. To solve these issues, we propose VideoTemp-o3, a unified agentic thinking-with-videos framework that jointly models video grounding and question answering. VideoTemp-o3 exhibits strong localization capability, supports on-demand clipping, and can refine inaccurate localizations. Specifically, in the supervised fine-tuning stage, we design a unified masking mechanism that encourages exploration while preventing noise. For reinforcement learning, we introduce dedicated rewards to mitigate reward hacking. Besides, from the data perspective, we develop an effective pipeline to construct high-quality long video grounded QA data, along with a corresponding benchmark for systematic evaluation across various video durations. Experimental results demonstrate that our method achieves remarkable performance on both long video understanding and grounding.

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

TL;DR

VideoTemp-o3 addresses the challenge of long-video understanding by unifying temporal grounding with VideoQA in a single agentic model that can crop on demand and iteratively refine its grounding. It introduces a cold-start SFT regime with a unified masking strategy and a penalty-aware RL framework (GRPO) with dedicated rewards to boost grounding accuracy while reducing reward hacking. A data construction pipeline and VideoTemp-Bench are proposed to equip and evaluate models on long-video grounded QA across diverse durations. Empirical results show state-of-the-art performance on multiple long-video QA and grounding benchmarks, validating the effectiveness of on-demand clipping and iterative grounding in enhancing video understanding. The work lays a foundation for scalable, interpretable thinking-with-videos and suggests avenues for integrating additional external tools in pragmatic settings.

Abstract

In long-video understanding, conventional uniform frame sampling often fails to capture key visual evidence, leading to degraded performance and increased hallucinations. To address this, recent agentic thinking-with-videos paradigms have emerged, adopting a localize-clip-answer pipeline in which the model actively identifies relevant video segments, performs dense sampling within those clips, and then produces answers. However, existing methods remain inefficient, suffer from weak localization, and adhere to rigid workflows. To solve these issues, we propose VideoTemp-o3, a unified agentic thinking-with-videos framework that jointly models video grounding and question answering. VideoTemp-o3 exhibits strong localization capability, supports on-demand clipping, and can refine inaccurate localizations. Specifically, in the supervised fine-tuning stage, we design a unified masking mechanism that encourages exploration while preventing noise. For reinforcement learning, we introduce dedicated rewards to mitigate reward hacking. Besides, from the data perspective, we develop an effective pipeline to construct high-quality long video grounded QA data, along with a corresponding benchmark for systematic evaluation across various video durations. Experimental results demonstrate that our method achieves remarkable performance on both long video understanding and grounding.
Paper Structure (37 sections, 6 equations, 21 figures, 6 tables)

This paper contains 37 sections, 6 equations, 21 figures, 6 tables.

Figures (21)

  • Figure 1: Illustration of the agentic pipeline in VideoTemp-o3. Given the video QA pair, it performs on-demand grounding and refines the initial rough segment. Finally, it produces a reliable answer grounded in the pertinent visual evidence.
  • Figure 2: Multi-turn, multi-tool call data curation pipeline.
  • Figure 3: Training Data Distribution.
  • Figure 4: The unified masking mechanism, where only the last two turns of responses are supervised while others are masked.
  • Figure 5: Reward hacking with native IoU rewards.
  • ...and 16 more figures