Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency
Hongyu Li, Songhao Han, Yue Liao, Junfeng Luo, Jialin Gao, Shuicheng Yan, Si Liu
TL;DR
This paper introduces Temporal-RLT, a GRPO-based post-training framework to boost video-specific reasoning in multimodal large language models. It employs a dual reward design—discrete semantic signals from multi-choice VideoQA and continuous temporal signals from temporal IoU grounding—together with variance-aware data selection to improve learning efficiency. The authors construct Temporal-RLT-Full-490k and Temporal-RLT-32k datasets and demonstrate strong, data-efficient gains across eight video understanding benchmarks, including VideoQA, grounding, and reasoning tasks, outperforming supervised fine-tuning and prior RLT baselines. The work highlights reward design and sample selection as key levers for scalable, reasoning-centric video understanding with VideoLLMs, while noting limitations and future directions for richer thinking supervision.
Abstract
Understanding real-world videos with complex semantics and long temporal dependencies remains a fundamental challenge in computer vision. Recent progress in multimodal large language models (MLLMs) has demonstrated strong capabilities in vision-language tasks, while reinforcement learning tuning (RLT) has further improved their reasoning abilities. In this work, we explore RLT as a post-training strategy to enhance the video-specific reasoning capabilities of MLLMs. Built upon the Group Relative Policy Optimization (GRPO) framework, we propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals. To facilitate effective preference-based optimization, we introduce a variance-aware data selection strategy based on repeated inference to identify samples that provide informative learning signals. We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA. Our method consistently outperforms supervised fine-tuning and existing RLT baselines, achieving superior performance with significantly less training data. These results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs. Notably, The initial code release (two months ago) has now been expanded with updates, including optimized reward mechanisms and additional datasets. The latest version is available at https://github.com/appletea233/Temporal-R1 .
