Table of Contents
Fetching ...

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Yolo Y. Tang, Daiki Shimada, Hang Hua, Chao Huang, Jing Bi, Rogerio Feris, Chenliang Xu

TL;DR

This work tackles the problem of text-rich video understanding by addressing hallucinations and brittle reasoning in single-pass perception. It introduces Video-R4, a visual-rumination framework that iteratively selects frames, zooms into regions, re-encodes pixels, and updates its reasoning state in a closed-loop read–retrieve–refocus–reinforce cycle. To train such a system, the authors curate two datasets, Video-R4-CoT-17k for supervised deliberate practice and Video-R4-RL-30k for reinforcement learning, and implement a four-stage curriculum (DRP-SFT, RL_d, CRP-SFT, RL_c) built on GRPO with a carefully designed reward design (Diversity, Representativeness, Curiosity). Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and demonstrates robust generalization to multi-page documents, slides, and general video QA, highlighting the broad applicability of iterative pixel-grounded reasoning. The results suggest that empowering LMMs with explicit visual tools and multi-step grounding can lead to more reliable and interpretable multimodal reasoning in complex, text-rich scenarios.

Abstract

Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning. Project Page: https://yunlong10.github.io/Video-R4/

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

TL;DR

This work tackles the problem of text-rich video understanding by addressing hallucinations and brittle reasoning in single-pass perception. It introduces Video-R4, a visual-rumination framework that iteratively selects frames, zooms into regions, re-encodes pixels, and updates its reasoning state in a closed-loop read–retrieve–refocus–reinforce cycle. To train such a system, the authors curate two datasets, Video-R4-CoT-17k for supervised deliberate practice and Video-R4-RL-30k for reinforcement learning, and implement a four-stage curriculum (DRP-SFT, RL_d, CRP-SFT, RL_c) built on GRPO with a carefully designed reward design (Diversity, Representativeness, Curiosity). Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and demonstrates robust generalization to multi-page documents, slides, and general video QA, highlighting the broad applicability of iterative pixel-grounded reasoning. The results suggest that empowering LMMs with explicit visual tools and multi-step grounding can lead to more reliable and interpretable multimodal reasoning in complex, text-rich scenarios.

Abstract

Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning. Project Page: https://yunlong10.github.io/Video-R4/

Paper Structure

This paper contains 46 sections, 9 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Video-R4 performs iterative visual rumination by selecting frames, zooming into regions, and re-encoding pixels, forming a closed-loop read–retrieve–refocus–reinforce cycle for grounded video reasoning.
  • Figure 2: Our Video-R4-7B model achieves state-of-the-art performance on the text-rich video understanding dataset M4-ViteVQA, and is also compatible with the LMMs with the same size on the general video QA benchmarks.
  • Figure 3: Data curation pipeline for creating the Video-R4-CoT-17k for supervised deliberate rumination practice fine-tuning (DRP-SFT) and compositional rumination practice fine-tuning (CRP-SFT), as well as the Video-R4-RL-30k dataset for reinforcement learning. The light blue parts are intended to be used as the model’s inputs, while the pink parts are expected to be produced by the model as outputs.
  • Figure 4: Overview of multi-stage rumination training framework.
  • Figure 5: Overall statistics of the Video-R4-CoT-17k dataset, including the ratio of video versus image samples, word cloud of frequently appearing terms, question length distribution, distribution of visual operation counts per sample, and conversation turn count distribution.
  • ...and 5 more figures