Table of Contents
Fetching ...

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng

TL;DR

TempSamp-R1 addresses the inefficiencies of on-policy GRPO in long-span video temporal grounding by integrating high-quality off-policy supervision with mixed-policy sampling. It introduces a non-linear soft advantage to stabilize reward-based updates and a hybrid Chain-of-Thought training regime that supports both CoT and non-CoT inference in a single model. Empirical results on Charades-STA, ActivityNet Captions, and QVHighlights show state-of-the-art performance and strong few-shot generalization, outperforming SFT and GRPO baselines. The approach offers a data-efficient, stable path for scaling temporal grounding in multimodal video-language models.

Abstract

This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

TL;DR

TempSamp-R1 addresses the inefficiencies of on-policy GRPO in long-span video temporal grounding by integrating high-quality off-policy supervision with mixed-policy sampling. It introduces a non-linear soft advantage to stabilize reward-based updates and a hybrid Chain-of-Thought training regime that supports both CoT and non-CoT inference in a single model. Empirical results on Charades-STA, ActivityNet Captions, and QVHighlights show state-of-the-art performance and strong few-shot generalization, outperforming SFT and GRPO baselines. The approach offers a data-efficient, stable path for scaling temporal grounding in multimodal video-language models.

Abstract

This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1

Paper Structure

This paper contains 17 sections, 4 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: TempSamp-R1 integrates high-quality off-policy solutions with on-policy sampling, combined with soft advantage estimation to enable stable policy updates. It outperforms GRPO, which relies solely on on-policy sampling, on both Charades-STA and ActivityNet Captions.
  • Figure 2: Overview of the TempSamp-R1 framework used to fine-tune the multimodal policy model. Given a few training examples, both the policy model and the off-policy guidance are used to generate solutions. Rewards are computed for each solution, and a soft advantage estimation module transforms raw rewards into standardized advantages for stable policy optimization. Right: Comparison of normalized advantages from GRPO (top) and our method (bottom), illustrating improved advantage discrimination. For clarity, the reference model and KL penalty are omitted.
  • Figure 3: Skewness of the advantage distributions during training for different variants.
  • Figure 4: Ablation results comparing GRPO with enhanced variants incorporating mixed-policy rewards and alternative advantage shaping strategies.
  • Figure 4: Distribution of top-1 IoU rewards under GRPO and TempSamp-R1 on Charades-STA and ActivityNet Captions. TempSamp-R1 exhibits higher median rewards and reduced variance, indicating more stable and effective policy learning.
  • ...and 4 more figures