Table of Contents
Fetching ...

Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding

Houlun Chen, Xin Wang, Guangyao Li, Yuwei Zhou, Yihan Chen, Jia Jia, Wenwu Zhu

TL;DR

Video-TwG presents think-with-grounding, a curriculum-reinforced framework for long video understanding that dynamically grounds relevant clips during multi-turn video-LM reasoning. By introducing a two-stage reinforced curriculum and the TwG-GRPO training algorithm, it achieves improved QA accuracy while reducing unnecessary grounding actions, without relying on heavy supervision. The TwG-51K dataset supports training across grounded and unlabeled data, enabling robust generalization across diverse long-video benchmarks. Empirical results on Video-MME, LongVideoBench, and MLVU demonstrate consistent gains over strong baselines, highlighting the practical impact of retrieval-augmented grounding and selective perceptual zooming for long-form video reasoning.

Abstract

Long video understanding is challenging due to rich and complicated multimodal clues in long temporal range.Current methods adopt reasoning to improve the model's ability to analyze complex video clues in long videos via text-form reasoning.However,the existing literature suffers from the fact that the text-only reasoning under fixed video context may exacerbate hallucinations since detailed crucial clues are often ignored under limited video context length due to the temporal redundancy of long videos.To address this gap,we propose Video-TwG,a curriculum reinforced framework that employs a novel Think-with-Grounding paradigm,enabling video LLMs to actively decide when to perform on-demand grounding during interleaved text-video reasoning, selectively zooming into question-relevant clips only when necessary.Video-TwG can be trained end-to-end in a straightforward manner, without relying on complex auxiliary modules or heavily annotated reasoning tracesIn detail,we design a Two-stage Reinforced Curriculum Strategy, where the model first learns think-with-grounding behavior on a small short-video GQA dataset with grounding labels,and then scales to diverse general QA data with videos of diverse domains to encourage generalization. Further, to handle complex think-with-grounding reasoning for various kinds of data,we propose TwG-GRPO algorithm which features the fine-grained grounding reward, self-confirmed pseudo reward and accuracy-gated mechanism.Finally,we propose to construct a new TwG-51K dataset that facilitates training. Experiments on Video-MME, LongVideoBench, and MLVU show that Video-TwG consistently outperforms strong LVU baselines.Further ablation validates the necessity of our Two-stage Reinforced Curriculum Strategy and shows our TwG-GRPO better leverages diverse unlabeled data to improve grounding quality and reduce redundant groundings without sacrificing QA performance.

Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding

TL;DR

Video-TwG presents think-with-grounding, a curriculum-reinforced framework for long video understanding that dynamically grounds relevant clips during multi-turn video-LM reasoning. By introducing a two-stage reinforced curriculum and the TwG-GRPO training algorithm, it achieves improved QA accuracy while reducing unnecessary grounding actions, without relying on heavy supervision. The TwG-51K dataset supports training across grounded and unlabeled data, enabling robust generalization across diverse long-video benchmarks. Empirical results on Video-MME, LongVideoBench, and MLVU demonstrate consistent gains over strong baselines, highlighting the practical impact of retrieval-augmented grounding and selective perceptual zooming for long-form video reasoning.

Abstract

Long video understanding is challenging due to rich and complicated multimodal clues in long temporal range.Current methods adopt reasoning to improve the model's ability to analyze complex video clues in long videos via text-form reasoning.However,the existing literature suffers from the fact that the text-only reasoning under fixed video context may exacerbate hallucinations since detailed crucial clues are often ignored under limited video context length due to the temporal redundancy of long videos.To address this gap,we propose Video-TwG,a curriculum reinforced framework that employs a novel Think-with-Grounding paradigm,enabling video LLMs to actively decide when to perform on-demand grounding during interleaved text-video reasoning, selectively zooming into question-relevant clips only when necessary.Video-TwG can be trained end-to-end in a straightforward manner, without relying on complex auxiliary modules or heavily annotated reasoning tracesIn detail,we design a Two-stage Reinforced Curriculum Strategy, where the model first learns think-with-grounding behavior on a small short-video GQA dataset with grounding labels,and then scales to diverse general QA data with videos of diverse domains to encourage generalization. Further, to handle complex think-with-grounding reasoning for various kinds of data,we propose TwG-GRPO algorithm which features the fine-grained grounding reward, self-confirmed pseudo reward and accuracy-gated mechanism.Finally,we propose to construct a new TwG-51K dataset that facilitates training. Experiments on Video-MME, LongVideoBench, and MLVU show that Video-TwG consistently outperforms strong LVU baselines.Further ablation validates the necessity of our Two-stage Reinforced Curriculum Strategy and shows our TwG-GRPO better leverages diverse unlabeled data to improve grounding quality and reduce redundant groundings without sacrificing QA performance.
Paper Structure (19 sections, 10 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 19 sections, 10 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Conceptual level comparison between our Video-TwG and existing video reasoning models. The frame with blue triangle means the important detail for this question. Existing reasoning models reach the wrong answer via reasoning on missed video context, while we address it via leveraging grounding during reasoning to dynamically perceive important clues.
  • Figure 2: The trajectory of think-with-grounding. Initially, the coarse-grained video and question are given and in each turn, the model gives a thinking process and an action based on the history interaction. If the action is grounding, the grounded video clip is zoomed in with a fine-grained representation and added to the context. If the action is answering, the reasoning process stops.
  • Figure 3: The illustration of our proposed TwG-GRPO algorithm.
  • Figure 4: Ablation results on Video-MME of our models with different training stages.
  • Figure 5: Training curves of several variants of our Video-TwG in stage 1. The curves are smoothed for aesthetics. It's the same below.
  • ...and 2 more figures