VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li; Ziang Yan; Desen Meng; Lu Dong; Xiangyu Zeng; Yinan He; Yali Wang; Yu Qiao; Yi Wang; Limin Wang

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, Limin Wang

TL;DR

The paper addresses the challenge of enabling video understanding in multimodal LLMs by introducing Reinforcement Fine-Tuning (RFT) guided by Group Relative Policy Optimization (GRPO) with spatio-temporal rewards. It formulates a multi-reward framework (format, IoU, accuracy, recall) and demonstrates how to combine these signals to train VideoChat-R1 on 18k+ samples, achieving state-of-the-art spatio-temporal perception (temporal grounding and object tracking) while preserving general QA and chat capabilities. A key contribution is the Temporal Clue-driven Reasoning paradigm, where model-provided clues are used to fetch higher-resolution clips for refinement, improving accuracy in long-video understanding. The results indicate that GRPO-based RFT is data-efficient and robust across tasks, offering a practical path toward reliable video dialogue systems and near-closed-loop video comprehension.

Abstract

Reinforcement Learning (RL) benefits Large Language Models (LLMs) for complex reasoning. Inspired by this, we explore integrating spatio-temporal specific rewards into Multimodal Large Language Models (MLLMs) to address the unique challenges of video understanding, such as long-range temporal associations. This paper investigates how rule-based rewards, particularly temporal ones, can improve video reasoning and their generalizability. Our study proposes Reinforcement Fine-Tuning (RFT) as a data-efficient method to enhance video reasoning on specific tasks without sacrificing original capabilities. Through joint RFT on multiple spatio-temporal perception tasks, we developed VideoChat-R1, a powerful Video MLLM. VideoChat-R1 achieves state-of-the-art spatio-temporal perception, demonstrating significant improvements in tasks like temporal grounding (+31.8) and object tracking (+31.2), while also improving general QA benchmarks. The enhanced perception and preserved chat abilities contribute to a more reliable video dialogue system, leading to our ``Temporal Clue-driven Reasoning" inference schema. This work provides a foundation for developing robust, real-world video comprehension agents.

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

TL;DR

Abstract

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)