Table of Contents
Fetching ...

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, Limin Wang

TL;DR

The paper addresses the challenge of enabling video understanding in multimodal LLMs by introducing Reinforcement Fine-Tuning (RFT) guided by Group Relative Policy Optimization (GRPO) with spatio-temporal rewards. It formulates a multi-reward framework (format, IoU, accuracy, recall) and demonstrates how to combine these signals to train VideoChat-R1 on 18k+ samples, achieving state-of-the-art spatio-temporal perception (temporal grounding and object tracking) while preserving general QA and chat capabilities. A key contribution is the Temporal Clue-driven Reasoning paradigm, where model-provided clues are used to fetch higher-resolution clips for refinement, improving accuracy in long-video understanding. The results indicate that GRPO-based RFT is data-efficient and robust across tasks, offering a practical path toward reliable video dialogue systems and near-closed-loop video comprehension.

Abstract

Reinforcement Learning (RL) benefits Large Language Models (LLMs) for complex reasoning. Inspired by this, we explore integrating spatio-temporal specific rewards into Multimodal Large Language Models (MLLMs) to address the unique challenges of video understanding, such as long-range temporal associations. This paper investigates how rule-based rewards, particularly temporal ones, can improve video reasoning and their generalizability. Our study proposes Reinforcement Fine-Tuning (RFT) as a data-efficient method to enhance video reasoning on specific tasks without sacrificing original capabilities. Through joint RFT on multiple spatio-temporal perception tasks, we developed VideoChat-R1, a powerful Video MLLM. VideoChat-R1 achieves state-of-the-art spatio-temporal perception, demonstrating significant improvements in tasks like temporal grounding (+31.8) and object tracking (+31.2), while also improving general QA benchmarks. The enhanced perception and preserved chat abilities contribute to a more reliable video dialogue system, leading to our ``Temporal Clue-driven Reasoning" inference schema. This work provides a foundation for developing robust, real-world video comprehension agents.

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

TL;DR

The paper addresses the challenge of enabling video understanding in multimodal LLMs by introducing Reinforcement Fine-Tuning (RFT) guided by Group Relative Policy Optimization (GRPO) with spatio-temporal rewards. It formulates a multi-reward framework (format, IoU, accuracy, recall) and demonstrates how to combine these signals to train VideoChat-R1 on 18k+ samples, achieving state-of-the-art spatio-temporal perception (temporal grounding and object tracking) while preserving general QA and chat capabilities. A key contribution is the Temporal Clue-driven Reasoning paradigm, where model-provided clues are used to fetch higher-resolution clips for refinement, improving accuracy in long-video understanding. The results indicate that GRPO-based RFT is data-efficient and robust across tasks, offering a practical path toward reliable video dialogue systems and near-closed-loop video comprehension.

Abstract

Reinforcement Learning (RL) benefits Large Language Models (LLMs) for complex reasoning. Inspired by this, we explore integrating spatio-temporal specific rewards into Multimodal Large Language Models (MLLMs) to address the unique challenges of video understanding, such as long-range temporal associations. This paper investigates how rule-based rewards, particularly temporal ones, can improve video reasoning and their generalizability. Our study proposes Reinforcement Fine-Tuning (RFT) as a data-efficient method to enhance video reasoning on specific tasks without sacrificing original capabilities. Through joint RFT on multiple spatio-temporal perception tasks, we developed VideoChat-R1, a powerful Video MLLM. VideoChat-R1 achieves state-of-the-art spatio-temporal perception, demonstrating significant improvements in tasks like temporal grounding (+31.8) and object tracking (+31.2), while also improving general QA benchmarks. The enhanced perception and preserved chat abilities contribute to a more reliable video dialogue system, leading to our ``Temporal Clue-driven Reasoning" inference schema. This work provides a foundation for developing robust, real-world video comprehension agents.

Paper Structure

This paper contains 31 sections, 5 equations, 3 figures, 8 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of VideoChat-R1. Through reinforcement learning fine-tuning using GRPO, VideoChat-R1 has powerful spatio-temporal perception capabilities and can apply these capabilities in chatting scenarios.
  • Figure 2: Examples on temporal grounding task. VideoChat-R1 gives a more accurate time interval after thinking.
  • Figure 3: Examples on Video QA task. It can be seen that VideoChat-R1 can not only answer questions correctly but also provide relatively accurate reference time periods (clue).