Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination
Yolo Y. Tang, Daiki Shimada, Hang Hua, Chao Huang, Jing Bi, Rogerio Feris, Chenliang Xu
TL;DR
This work tackles the problem of text-rich video understanding by addressing hallucinations and brittle reasoning in single-pass perception. It introduces Video-R4, a visual-rumination framework that iteratively selects frames, zooms into regions, re-encodes pixels, and updates its reasoning state in a closed-loop read–retrieve–refocus–reinforce cycle. To train such a system, the authors curate two datasets, Video-R4-CoT-17k for supervised deliberate practice and Video-R4-RL-30k for reinforcement learning, and implement a four-stage curriculum (DRP-SFT, RL_d, CRP-SFT, RL_c) built on GRPO with a carefully designed reward design (Diversity, Representativeness, Curiosity). Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and demonstrates robust generalization to multi-page documents, slides, and general video QA, highlighting the broad applicability of iterative pixel-grounded reasoning. The results suggest that empowering LMMs with explicit visual tools and multi-step grounding can lead to more reliable and interpretable multimodal reasoning in complex, text-rich scenarios.
Abstract
Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning. Project Page: https://yunlong10.github.io/Video-R4/
