Table of Contents
Fetching ...

TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Canhui Tang, Zifan Han, Hongbo Sun, Sanping Zhou, Xuchong Zhang, Xin Wei, Ye Yuan, Huayu Zhang, Jinglin Xu, Hao Sun

TL;DR

TSPO tackles the challenge of long-form video language understanding by learning a trainable sparse frame sampling policy through reinforcement learning. It introduces an event-aware temporal agent to score candidate frames and uses GRPO-based optimization to jointly refine keyframe selection and language generation, while freezing the LLM to maintain stability. The approach leverages two data pipelines—Comprehensive Temporal Data and Video Needle-in-a-Haystack—to provide targeted rewards for temporal localization and answering accuracy, yielding state-of-the-art results across multiple benchmarks and transferring to different Video-MLLMs. This work reduces computation, improves accuracy on long videos, and provides a practical framework for scalable long-form video QA with MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. However, building a trainable sampling method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling in Video-MLLMs. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization for the temporal sampling policy. Furthermore, we propose a dual-style long video training data construction pipeline, balancing comprehensive temporal understanding and key segment localization. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs. Our code is available at https://github.com/Hui-design/TSPO

TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

TL;DR

TSPO tackles the challenge of long-form video language understanding by learning a trainable sparse frame sampling policy through reinforcement learning. It introduces an event-aware temporal agent to score candidate frames and uses GRPO-based optimization to jointly refine keyframe selection and language generation, while freezing the LLM to maintain stability. The approach leverages two data pipelines—Comprehensive Temporal Data and Video Needle-in-a-Haystack—to provide targeted rewards for temporal localization and answering accuracy, yielding state-of-the-art results across multiple benchmarks and transferring to different Video-MLLMs. This work reduces computation, improves accuracy on long videos, and provides a practical framework for scalable long-form video QA with MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. However, building a trainable sampling method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling in Video-MLLMs. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization for the temporal sampling policy. Furthermore, we propose a dual-style long video training data construction pipeline, balancing comprehensive temporal understanding and key segment localization. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs. Our code is available at https://github.com/Hui-design/TSPO

Paper Structure

This paper contains 14 sections, 11 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustrations of different frame sampling methods: Training-free uniform sampling (a) and keyframe search (b) select unsatisfactory frames, while our method (c) explores and optimizes the temporal sampling policy that leads to the correct answer in an end-to-end training manner.
  • Figure 2: The overview of our TSPO framework. The training pipeline takes long videos as inputs, first employing a temporal agent to sample $G$ keyframe combinations (only one during inference), then optimizing the sampling policy through our temporal sampling policy optimization algorithm with Temporal localization reward $R_T$ and Answering Accuracy reward $R_A$.
  • Figure 3: Comparison between our TSPO and previous Video-MLLM optimization methods. We model keyframe selection and language generation as a joint decision-making process for end-to-end optimization of the temporal agent.
  • Figure 4: Our proposed TSPO-targeted long video training data construction pipeline.
  • Figure 5: Visualization comparisons of sampled frames and corresponding responses between ours and LLaVA-Video.