Table of Contents
Fetching ...

ViSS-R1: Self-Supervised Reinforcement Video Reasoning

Bo Fang, Yuxin Song, Qiangqiang Wu, Haoyuan Sun, Wenhao Wu, Antoni B. Chan

TL;DR

The paper tackles the gap in visual-centric video reasoning for MLLMs by introducing self-supervised reinforcement learning modalities. It presents two contributions: Pretext-GRPO, a warm-start RL method that learns from transformation-based pretext tasks, and ViSS-R1, a single-stage framework that integrates self-supervised learning directly into the R1 pipeline with transform-aware prompts and multi-component rewards. Empirical results across six benchmarks show improved performance, with state-of-the-art results on four tasks and robust gains over baseline R1 approaches. This approach emphasizes leveraging rich visual information and structured reasoning to reduce hallucinations and enhance generalization in video understanding. The work positions SSL-driven RL as a practical path to end-to-end, visual-centric video reasoning in multimodal systems.

Abstract

Complex video reasoning remains a significant challenge for Multimodal Large Language Models (MLLMs), as current R1-based methodologies often prioritize text-centric reasoning derived from text-based and image-based developments. In video tasks, such strategies frequently underutilize rich visual information, leading to potential shortcut learning and increased susceptibility to hallucination. To foster a more robust, visual-centric video understanding, we start by introducing a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline, in which positive rewards are assigned for correctly solving pretext tasks on transformed visual inputs, which makes the model to non-trivially process the visual information. Building on the effectiveness of Pretext-GRPO, we further propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm. Instead of relying solely on sparse visual cues, our framework compels models to reason about transformed visual input by simultaneously processing both pretext questions (concerning transformations) and true user queries. This necessitates identifying the applied transformation and reconstructing the original video to formulate accurate final answers. Comprehensive evaluations on six widely-used video reasoning and understanding benchmarks demonstrate the effectiveness and superiority of our Pretext-GRPO and ViSS-R1 for complex video reasoning. Our codes and models will be publicly available.

ViSS-R1: Self-Supervised Reinforcement Video Reasoning

TL;DR

The paper tackles the gap in visual-centric video reasoning for MLLMs by introducing self-supervised reinforcement learning modalities. It presents two contributions: Pretext-GRPO, a warm-start RL method that learns from transformation-based pretext tasks, and ViSS-R1, a single-stage framework that integrates self-supervised learning directly into the R1 pipeline with transform-aware prompts and multi-component rewards. Empirical results across six benchmarks show improved performance, with state-of-the-art results on four tasks and robust gains over baseline R1 approaches. This approach emphasizes leveraging rich visual information and structured reasoning to reduce hallucinations and enhance generalization in video understanding. The work positions SSL-driven RL as a practical path to end-to-end, visual-centric video reasoning in multimodal systems.

Abstract

Complex video reasoning remains a significant challenge for Multimodal Large Language Models (MLLMs), as current R1-based methodologies often prioritize text-centric reasoning derived from text-based and image-based developments. In video tasks, such strategies frequently underutilize rich visual information, leading to potential shortcut learning and increased susceptibility to hallucination. To foster a more robust, visual-centric video understanding, we start by introducing a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline, in which positive rewards are assigned for correctly solving pretext tasks on transformed visual inputs, which makes the model to non-trivially process the visual information. Building on the effectiveness of Pretext-GRPO, we further propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm. Instead of relying solely on sparse visual cues, our framework compels models to reason about transformed visual input by simultaneously processing both pretext questions (concerning transformations) and true user queries. This necessitates identifying the applied transformation and reconstructing the original video to formulate accurate final answers. Comprehensive evaluations on six widely-used video reasoning and understanding benchmarks demonstrate the effectiveness and superiority of our Pretext-GRPO and ViSS-R1 for complex video reasoning. Our codes and models will be publicly available.

Paper Structure

This paper contains 14 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview. (a) Standard R1 paradigm for MLLM video reasoning consists of a SFT memorization stage followed by RL exploration. (b) We introduce an intermediate Pretext-GRPO stage for visual-centric RL reasoning by learning from self-supervised visual transformations Tr(V). (c) Our ViSS-R1 framework fully integrates pretext-task reasoning into the R1-paradigm, where the model takes transformed visual inputs Tr(V) and is designed to simultaneously output the inferred transformation (Tr) and the query's answer (A).
  • Figure 2: (a) Example of a "Reverse" MCQ pretext question used in our Pretext-GRPO, where randomly transformed visual inputs are leveraged to construct targeted pretext queries for policy model prompting. (b) Pretext-GRPO+ denotes a Pretext-GRPO stage followed by vanilla GRPO, which consistently improves performance across multiple video benchmarks. All results are based on 16-frame evaluation.
  • Figure 3: ViSS-R1 framework. Mixed images and videos are randomly augmented with SSL transformations for both SFT and RL reasoning. The models are required to simultaneously address pretext questions regarding the applied transformations and answer real user queries. ViSS-R1 additionally learns a <transform> tag in SFT to encapsulate pretext identification results, which provides structural organization for answers and facilitates reward manipulation during RL exploration.
  • Figure 4: Impact of training reward $R_t$.
  • Figure 5: Qualitative results. For ViSS-R1, we remove the pretext question in prompts and use untransformed videos for inference.
  • ...and 1 more figures