Table of Contents
Fetching ...

ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis

Congzhi Zhang, Zhibin Wang, Yinchao Ma, Jiawei Peng, Yihan Wang, Qiang Zhou, Jun Song, Bo Zheng

TL;DR

This paper tackles the data bottleneck hindering complex video reasoning in LVLMs by introducing ReWatch, a large-scale, temporally dense dataset built via a three-stage synthesis pipeline (Captions, QA, CoT). It pairs this with ReWatch-R1, a post-trained LVLM augmented by supervised fine-tuning and reinforcement learning with an Observation & Reasoning (O&R) reward that enforces both final-answer accuracy and grounding of intermediate reasoning in video content. The approach yields state-of-the-art performance across five challenging video reasoning benchmarks and demonstrates that high-quality, video-grounded CoT and contrastive QA data significantly improve robustness and factuality while mitigating hallucinations. The results highlight the practical potential of agentic data synthesis and process-oriented RL for scalable, reliable temporal reasoning in LVLMs. The work offers a scalable paradigm for building video-understanding systems that reason across long temporal contexts with verifiable, evidence-linked traces.

Abstract

While Reinforcement Learning with Verifiable Reward (RLVR) significantly advances image reasoning in Large Vision-Language Models (LVLMs), its application to complex video reasoning remains underdeveloped. This gap stems primarily from a critical data bottleneck: existing datasets lack the challenging, multi-hop questions and high-quality, video-grounded Chain-of-Thought (CoT) data necessary to effectively bootstrap RLVR. To address this, we introduce ReWatch, a large-scale dataset built to foster advanced video reasoning. We propose a novel multi-stage synthesis pipeline to synthesize its three components: ReWatch-Caption, ReWatch-QA, and ReWatch-CoT. A core innovation is our Multi-Agent ReAct framework for CoT synthesis, which simulates a human-like "re-watching" process to generate video-grounded reasoning traces by explicitly modeling information retrieval and verification. Building on this dataset, we develop ReWatch-R1 by post-training a strong baseline LVLM with Supervised Fine-Tuning (SFT) and our RLVR framework. This framework incorporates a novel Observation \& Reasoning (O\&R) reward mechanism that evaluates both the final answer's correctness and the reasoning's alignment with video content, directly penalizing hallucination. Our experiments show that ReWatch-R1 achieves state-of-the-art average performance on five challenging video reasoning benchmarks. Project Page: https://rewatch-r1.github.io

ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis

TL;DR

This paper tackles the data bottleneck hindering complex video reasoning in LVLMs by introducing ReWatch, a large-scale, temporally dense dataset built via a three-stage synthesis pipeline (Captions, QA, CoT). It pairs this with ReWatch-R1, a post-trained LVLM augmented by supervised fine-tuning and reinforcement learning with an Observation & Reasoning (O&R) reward that enforces both final-answer accuracy and grounding of intermediate reasoning in video content. The approach yields state-of-the-art performance across five challenging video reasoning benchmarks and demonstrates that high-quality, video-grounded CoT and contrastive QA data significantly improve robustness and factuality while mitigating hallucinations. The results highlight the practical potential of agentic data synthesis and process-oriented RL for scalable, reliable temporal reasoning in LVLMs. The work offers a scalable paradigm for building video-understanding systems that reason across long temporal contexts with verifiable, evidence-linked traces.

Abstract

While Reinforcement Learning with Verifiable Reward (RLVR) significantly advances image reasoning in Large Vision-Language Models (LVLMs), its application to complex video reasoning remains underdeveloped. This gap stems primarily from a critical data bottleneck: existing datasets lack the challenging, multi-hop questions and high-quality, video-grounded Chain-of-Thought (CoT) data necessary to effectively bootstrap RLVR. To address this, we introduce ReWatch, a large-scale dataset built to foster advanced video reasoning. We propose a novel multi-stage synthesis pipeline to synthesize its three components: ReWatch-Caption, ReWatch-QA, and ReWatch-CoT. A core innovation is our Multi-Agent ReAct framework for CoT synthesis, which simulates a human-like "re-watching" process to generate video-grounded reasoning traces by explicitly modeling information retrieval and verification. Building on this dataset, we develop ReWatch-R1 by post-training a strong baseline LVLM with Supervised Fine-Tuning (SFT) and our RLVR framework. This framework incorporates a novel Observation \& Reasoning (O\&R) reward mechanism that evaluates both the final answer's correctness and the reasoning's alignment with video content, directly penalizing hallucination. Our experiments show that ReWatch-R1 achieves state-of-the-art average performance on five challenging video reasoning benchmarks. Project Page: https://rewatch-r1.github.io

Paper Structure

This paper contains 40 sections, 15 equations, 19 figures, 4 tables.

Figures (19)

  • Figure 1: Performance comparison of our ReWatch-R1 with previous state-of-the-art LVLMs on five video reasoning benchmarks. Except for Qwen2.5-VL-7B, all other models use thinking mode. All models were evaluated at 192 frames.
  • Figure 2: A comparative of ReWatch dataset and Video-R1 dataset on the same source video.
  • Figure 2: Statistics of our dataset.
  • Figure 3: The data construction pipeline.(a) Caption Construction. Long videos are semantically segmented to produce detailed, temporally-aware captions. (b) QA Pair Generation. A contrastive method using detailed and summary captions generates complex questions, which are then purified by a three-layer filtering mechanism. (c) CoT Synthesis. A ReAct framework with a Reasoner Agent and an Observer Agent simulates a "re-watching" process by performing targeted queries on the video caption to generate video-grounded reasoning traces.
  • Figure 4: Our two-stage Post-Training framework. (a) A Base Model is first fine-tuned (SFT) on all ReWatch datasets, (b) then further refined as a policy via Reinforcement Learning (RL) using the ReWatch-QA dataset. (c) The "Rollout" panel illustrates the generative process of the policy: producing a purely textual chain-of-thought that simulates a Thought-Action-Observation reasoning loop through self-generated text segments. (d) We employ four verifiable reward mechanisms.
  • ...and 14 more figures