Table of Contents
Fetching ...

VideoSSR: Video Self-Supervised Reinforcement Learning

Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, Yu Cheng

TL;DR

The paper tackles the data annotation bottleneck in video RLVR for multimodal large language models by exploiting intrinsic video signals to generate verifiable training data. It introduces three self-supervised pretext tasks—Anomaly Grounding, Object Counting, and Temporal Jigsaw—and a Video Intrinsic Understanding Benchmark (VIUBench) to probe intrinsic video comprehension. It then builds VideoSSR and the VideoSSR-30K dataset, pairing them with smooth reward functions for stable RLVR training via GRPO, and demonstrates substantial generalization improvements across 17 benchmarks (average ~5% gain) spanning General Video QA, Long Video QA, Temporal Grounding, and Complex Reasoning. Together, these contributions offer a scalable, low-cost pathway to enhance video understanding in MLLMs, with broad implications for advancing robust, self-supervised training in multimodal AI systems.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has substantially advanced the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, the rapid progress of MLLMs is outpacing the complexity of existing video datasets, while the manual annotation of new, high-quality data remains prohibitively expensive. This work investigates a pivotal question: Can the rich, intrinsic information within videos be harnessed to self-generate high-quality, verifiable training data? To investigate this, we introduce three self-supervised pretext tasks: Anomaly Grounding, Object Counting, and Temporal Jigsaw. We construct the Video Intrinsic Understanding Benchmark (VIUBench) to validate their difficulty, revealing that current state-of-the-art MLLMs struggle significantly on these tasks. Building upon these pretext tasks, we develop the VideoSSR-30K dataset and propose VideoSSR, a novel video self-supervised reinforcement learning framework for RLVR. Extensive experiments across 17 benchmarks, spanning four major video domains (General Video QA, Long Video QA, Temporal Grounding, and Complex Reasoning), demonstrate that VideoSSR consistently enhances model performance, yielding an average improvement of over 5\%. These results establish VideoSSR as a potent foundational framework for developing more advanced video understanding in MLLMs. The code is available at https://github.com/lcqysl/VideoSSR.

VideoSSR: Video Self-Supervised Reinforcement Learning

TL;DR

The paper tackles the data annotation bottleneck in video RLVR for multimodal large language models by exploiting intrinsic video signals to generate verifiable training data. It introduces three self-supervised pretext tasks—Anomaly Grounding, Object Counting, and Temporal Jigsaw—and a Video Intrinsic Understanding Benchmark (VIUBench) to probe intrinsic video comprehension. It then builds VideoSSR and the VideoSSR-30K dataset, pairing them with smooth reward functions for stable RLVR training via GRPO, and demonstrates substantial generalization improvements across 17 benchmarks (average ~5% gain) spanning General Video QA, Long Video QA, Temporal Grounding, and Complex Reasoning. Together, these contributions offer a scalable, low-cost pathway to enhance video understanding in MLLMs, with broad implications for advancing robust, self-supervised training in multimodal AI systems.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has substantially advanced the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, the rapid progress of MLLMs is outpacing the complexity of existing video datasets, while the manual annotation of new, high-quality data remains prohibitively expensive. This work investigates a pivotal question: Can the rich, intrinsic information within videos be harnessed to self-generate high-quality, verifiable training data? To investigate this, we introduce three self-supervised pretext tasks: Anomaly Grounding, Object Counting, and Temporal Jigsaw. We construct the Video Intrinsic Understanding Benchmark (VIUBench) to validate their difficulty, revealing that current state-of-the-art MLLMs struggle significantly on these tasks. Building upon these pretext tasks, we develop the VideoSSR-30K dataset and propose VideoSSR, a novel video self-supervised reinforcement learning framework for RLVR. Extensive experiments across 17 benchmarks, spanning four major video domains (General Video QA, Long Video QA, Temporal Grounding, and Complex Reasoning), demonstrate that VideoSSR consistently enhances model performance, yielding an average improvement of over 5\%. These results establish VideoSSR as a potent foundational framework for developing more advanced video understanding in MLLMs. The code is available at https://github.com/lcqysl/VideoSSR.

Paper Structure

This paper contains 31 sections, 8 equations, 19 figures, 8 tables.

Figures (19)

  • Figure 1: Distribution of answer correctness on ReWatch and LongVideoReason. Across both models and datasets, a vast majority of questions yield a bimodal outcome, resulting in either zero or eight correct answers. This zero variance issue is notably more pronounced for the more powerful Qwen3-VL model.
  • Figure 2: Performance comparison on four video tasks. Input frames for VideoSSR and Qwen3-VL-8B do not exceed 64.
  • Figure 3: An overview of our three self-supervised pretext tasks.(a) Anomaly Grounding: A temporal segment is perturbed (e.g., via rotation), and the task is to identify the start and end timestamps of this anomaly. (b) Object Counting: Procedurally generated shapes are overlaid onto selected frames, and the task is to count the total number of each shape type. (c) Temporal Jigsaw: The video is divided into clips which are then shuffled. The task is to predict the original temporal order of the segments.
  • Figure 4: Task distribution in VIUBench and VideoSSR-30K. The left panel illustrates the proportional data distribution across our three pretext tasks and their subtypes for VIUBench. The right panel shows the corresponding composition of VideoSSR-30K.
  • Figure 5: Comparison of single task and mixed task training at the 30k data scale. The results demonstrates that task diversity is more effective for improving performance than simply scaling up the data for a single pretext task.
  • ...and 14 more figures