Table of Contents
Fetching ...

DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO

Jinyoung Park, Jeehye Na, Jinyoung Kim, Hyunwoo J. Kim

TL;DR

DeepVideo-R1 tackles two core problems in GRPO-based fine-tuning of VideoLLMs: reliance on stabilizing safeguards and vanishing advantages. It introduces Reg-GRPO, a regression-based objective that directly predicts group-normalized advantages, removing clipping/min operations, coupled with difficulty-aware data augmentation to diversify rewards across sample difficulties. Empirical results across SEED-Bench-R1, LongVideoBench, and NextGQA show substantial improvements over GRPO and other RL methods, with strong generalization to in-distribution and out-of-distribution tasks. The work demonstrates that aligning training signals via advantage regression and adaptive data augmentation yields more robust, scalable training for multimodal video reasoning models.

Abstract

Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) has still been less studyed. In this paper, we explore GRPO and identify two problems that deteriorate the effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function into a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards such as the clipping and min functions. It directly aligns the model with advantages, providing guidance to prefer better ones. The difficulty-aware data augmentation strategy augments input prompts/videos to locate the difficulty of samples at solvable difficulty levels, enabling diverse reward signals. Our experimental results show that our approach significantly improves video reasoning performance across multiple benchmarks.

DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO

TL;DR

DeepVideo-R1 tackles two core problems in GRPO-based fine-tuning of VideoLLMs: reliance on stabilizing safeguards and vanishing advantages. It introduces Reg-GRPO, a regression-based objective that directly predicts group-normalized advantages, removing clipping/min operations, coupled with difficulty-aware data augmentation to diversify rewards across sample difficulties. Empirical results across SEED-Bench-R1, LongVideoBench, and NextGQA show substantial improvements over GRPO and other RL methods, with strong generalization to in-distribution and out-of-distribution tasks. The work demonstrates that aligning training signals via advantage regression and adaptive data augmentation yields more robust, scalable training for multimodal video reasoning models.

Abstract

Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) has still been less studyed. In this paper, we explore GRPO and identify two problems that deteriorate the effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function into a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards such as the clipping and min functions. It directly aligns the model with advantages, providing guidance to prefer better ones. The difficulty-aware data augmentation strategy augments input prompts/videos to locate the difficulty of samples at solvable difficulty levels, enabling diverse reward signals. Our experimental results show that our approach significantly improves video reasoning performance across multiple benchmarks.

Paper Structure

This paper contains 48 sections, 24 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: DeepVideo-R1 significantly improves the reasoning capabilities of VideoLLMs. Our VideoLLM, DeepVideo-R1, is trained to explicitly predict the advantage $\hat{A}^{(i)}$ through Regressive GRPO loss. Notably, model training becomes significantly effective and achieves a 10.1 performance improvement compared to GRPO.
  • Figure 2: Overview of the difficulty-aware data augmentation. First, we assess the difficulty of responses given the input video and question using Eq. \ref{['eq:difficulty']}. For hard samples, it augments the input prompts with the reasoning cues extracted from successful reasoning paths (Difficulty decreasement augmentation), while the easy samples are perturbed with the noise (Difficulty increasement augmentation). The scale of the guidance level or noise level is adaptively determined based on the difficulty of the current sample.
  • Figure 3: Vanishing advantage ratio comparison on GRPO and GRPO+DA-Aug (Difficulty-aware augmentation) (Left). Reward curves of DeepVideo-R1 (Ours) and GRPO (Right).
  • Figure 4: Qualitative result of DeepVideo-R1-7B in comparison of Qwen2.5-VL-7B+GRPO.