TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Canhui Tang; Zifan Han; Hongbo Sun; Sanping Zhou; Xuchong Zhang; Xin Wei; Ye Yuan; Huayu Zhang; Jinglin Xu; Hao Sun

TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Canhui Tang, Zifan Han, Hongbo Sun, Sanping Zhou, Xuchong Zhang, Xin Wei, Ye Yuan, Huayu Zhang, Jinglin Xu, Hao Sun

TL;DR

TSPO tackles the challenge of long-form video language understanding by learning a trainable sparse frame sampling policy through reinforcement learning. It introduces an event-aware temporal agent to score candidate frames and uses GRPO-based optimization to jointly refine keyframe selection and language generation, while freezing the LLM to maintain stability. The approach leverages two data pipelines—Comprehensive Temporal Data and Video Needle-in-a-Haystack—to provide targeted rewards for temporal localization and answering accuracy, yielding state-of-the-art results across multiple benchmarks and transferring to different Video-MLLMs. This work reduces computation, improves accuracy on long videos, and provides a practical framework for scalable long-form video QA with MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. However, building a trainable sampling method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling in Video-MLLMs. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization for the temporal sampling policy. Furthermore, we propose a dual-style long video training data construction pipeline, balancing comprehensive temporal understanding and key segment localization. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs. Our code is available at https://github.com/Hui-design/TSPO

TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

TL;DR

Abstract

TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)