Table of Contents
Fetching ...

Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Runhao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, Xuelian Cheng

TL;DR

The paper addresses extending Thinking with Images to video reasoning by endowing multimodal LLMs with intrinsic grounding and captioning during the reasoning process. It introduces Video-Thinker and the 10K Video-Thinker-10K dataset, trained in two stages (SFT then GRPO) to learn structured, temporally grounded reasoning traces without external tools. Empirical results show state-of-the-art performance for 7B-sized MLLMs across in-domain and challenging out-of-domain benchmarks (Video-Holmes, CG-Bench-Reasoning, VRBench), with strong gains from grounding and captioning and a data-efficient training regime. The work demonstrates robust temporal reasoning and suggests future scaling to larger models and multi-modal extensions beyond video.

Abstract

Recent advances in image reasoning methods, particularly "Thinking with Images", have demonstrated remarkable success in Multimodal Large Language Models (MLLMs); however, this dynamic reasoning paradigm has not yet been extended to video reasoning tasks. In this paper, we propose Video-Thinker, which empowers MLLMs to think with videos by autonomously leveraging their intrinsic "grounding" and "captioning" capabilities to generate reasoning clues throughout the inference process. To spark this capability, we construct Video-Thinker-10K, a curated dataset featuring autonomous tool usage within chain-of-thought reasoning sequences. Our training strategy begins with Supervised Fine-Tuning (SFT) to learn the reasoning format, followed by Group Relative Policy Optimization (GRPO) to strengthen this reasoning capability. Through this approach, Video-Thinker enables MLLMs to autonomously navigate grounding and captioning tasks for video reasoning, eliminating the need for constructing and calling external tools. Extensive experiments demonstrate that Video-Thinker achieves significant performance gains on both in-domain tasks and challenging out-of-domain video reasoning benchmarks, including Video-Holmes, CG-Bench-Reasoning, and VRBench. Our Video-Thinker-7B substantially outperforms existing baselines such as Video-R1 and establishes state-of-the-art performance among 7B-sized MLLMs.

Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

TL;DR

The paper addresses extending Thinking with Images to video reasoning by endowing multimodal LLMs with intrinsic grounding and captioning during the reasoning process. It introduces Video-Thinker and the 10K Video-Thinker-10K dataset, trained in two stages (SFT then GRPO) to learn structured, temporally grounded reasoning traces without external tools. Empirical results show state-of-the-art performance for 7B-sized MLLMs across in-domain and challenging out-of-domain benchmarks (Video-Holmes, CG-Bench-Reasoning, VRBench), with strong gains from grounding and captioning and a data-efficient training regime. The work demonstrates robust temporal reasoning and suggests future scaling to larger models and multi-modal extensions beyond video.

Abstract

Recent advances in image reasoning methods, particularly "Thinking with Images", have demonstrated remarkable success in Multimodal Large Language Models (MLLMs); however, this dynamic reasoning paradigm has not yet been extended to video reasoning tasks. In this paper, we propose Video-Thinker, which empowers MLLMs to think with videos by autonomously leveraging their intrinsic "grounding" and "captioning" capabilities to generate reasoning clues throughout the inference process. To spark this capability, we construct Video-Thinker-10K, a curated dataset featuring autonomous tool usage within chain-of-thought reasoning sequences. Our training strategy begins with Supervised Fine-Tuning (SFT) to learn the reasoning format, followed by Group Relative Policy Optimization (GRPO) to strengthen this reasoning capability. Through this approach, Video-Thinker enables MLLMs to autonomously navigate grounding and captioning tasks for video reasoning, eliminating the need for constructing and calling external tools. Extensive experiments demonstrate that Video-Thinker achieves significant performance gains on both in-domain tasks and challenging out-of-domain video reasoning benchmarks, including Video-Holmes, CG-Bench-Reasoning, and VRBench. Our Video-Thinker-7B substantially outperforms existing baselines such as Video-R1 and establishes state-of-the-art performance among 7B-sized MLLMs.

Paper Structure

This paper contains 24 sections, 8 equations, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overall Performance of Video-Thinker
  • Figure 2: Video-Thinker integrates "grounding" and "captioning" capabilities throughout the reasoning process using end-to-end reinforcement learning.
  • Figure 3: Data synthesis pipeline of Video-Thinker-10K where the data distribution is depicted in Figure \ref{['fig:dataset']} in Appendix \ref{['app:dataset']}.
  • Figure 4: An example of Video-Thinker-7B's reasoning output on CG-Bench-Reasoning dataset.
  • Figure 5: The data distribution of our Video-Thinker-10K dataset.
  • ...and 9 more figures