VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, Limin Wang
TL;DR
The paper addresses the challenge of enabling video understanding in multimodal LLMs by introducing Reinforcement Fine-Tuning (RFT) guided by Group Relative Policy Optimization (GRPO) with spatio-temporal rewards. It formulates a multi-reward framework (format, IoU, accuracy, recall) and demonstrates how to combine these signals to train VideoChat-R1 on 18k+ samples, achieving state-of-the-art spatio-temporal perception (temporal grounding and object tracking) while preserving general QA and chat capabilities. A key contribution is the Temporal Clue-driven Reasoning paradigm, where model-provided clues are used to fetch higher-resolution clips for refinement, improving accuracy in long-video understanding. The results indicate that GRPO-based RFT is data-efficient and robust across tasks, offering a practical path toward reliable video dialogue systems and near-closed-loop video comprehension.
Abstract
Reinforcement Learning (RL) benefits Large Language Models (LLMs) for complex reasoning. Inspired by this, we explore integrating spatio-temporal specific rewards into Multimodal Large Language Models (MLLMs) to address the unique challenges of video understanding, such as long-range temporal associations. This paper investigates how rule-based rewards, particularly temporal ones, can improve video reasoning and their generalizability. Our study proposes Reinforcement Fine-Tuning (RFT) as a data-efficient method to enhance video reasoning on specific tasks without sacrificing original capabilities. Through joint RFT on multiple spatio-temporal perception tasks, we developed VideoChat-R1, a powerful Video MLLM. VideoChat-R1 achieves state-of-the-art spatio-temporal perception, demonstrating significant improvements in tasks like temporal grounding (+31.8) and object tracking (+31.2), while also improving general QA benchmarks. The enhanced perception and preserved chat abilities contribute to a more reliable video dialogue system, leading to our ``Temporal Clue-driven Reasoning" inference schema. This work provides a foundation for developing robust, real-world video comprehension agents.
