Table of Contents
Fetching ...

Scaling RL to Long Videos

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han

TL;DR

LongVILA-R1 tackles the challenge of long-video reasoning by delivering a full-stack solution: a large-scale LongVideo-Reason dataset, a two-stage training regimen (CoT-SFT then RL), and a scalable MR-SP infrastructure for efficient long-context reinforcement learning. The approach yields strong results on multiple video benchmarks, supports up to 8,192 video frames per video, and achieves notable RL-speedups on long-context tasks. By releasing the training system publicly, the work paves the way for broader adoption of RL-based long-video reasoning across modalities and model families. The framework holds promise for robotics, embodied AI, and complex video analytics, while acknowledging practical limitations around compute demands and privacy considerations.

Abstract

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65.1% and 71.1% accuracy on VideoMME without and with subtitles, respectively, and consistently outperforming LongVILA-7B across multiple benchmarks. Moreover, LongVILA-R1-7B supports processing up to 8,192 video frames per video, and configurable FPS settings. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames).

Scaling RL to Long Videos

TL;DR

LongVILA-R1 tackles the challenge of long-video reasoning by delivering a full-stack solution: a large-scale LongVideo-Reason dataset, a two-stage training regimen (CoT-SFT then RL), and a scalable MR-SP infrastructure for efficient long-context reinforcement learning. The approach yields strong results on multiple video benchmarks, supports up to 8,192 video frames per video, and achieves notable RL-speedups on long-context tasks. By releasing the training system publicly, the work paves the way for broader adoption of RL-based long-video reasoning across modalities and model families. The framework holds promise for robotics, embodied AI, and complex video analytics, while acknowledging practical limitations around compute demands and privacy considerations.

Abstract

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65.1% and 71.1% accuracy on VideoMME without and with subtitles, respectively, and consistently outperforming LongVILA-7B across multiple benchmarks. Moreover, LongVILA-R1-7B supports processing up to 8,192 video frames per video, and configurable FPS settings. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames).

Paper Structure

This paper contains 23 sections, 2 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Examples of LongVILA-R1. The illustration demonstrates sample tasks and their reasoning. From left to right, the examples include predicting the results of a football match, decision-making reasoning in Texas Hold'em Poker, and trajectory for spatial dynamics of objects. Notably, the spatial tracking video involves a relatively complex dynamic moving, for which the model fails to achieve accurate reasoning until the number of input video frames increases to 128.
  • Figure 2: Training efficiency comparison with MR-SP (SP degree=4) on Qwen2.5-VL-7B and LongVILA-R1-7B and a single node 8$\times$ A100 GPUs. It achieves 2.1$\times$ speed-up and avoids GPU OOM issue on long frames.
  • Figure 3: Data Distribution of LongVideo-Reason and total data in the LongVILA-R1 training framework. LongVideo-Reason comprises a total of 18K videos and 104K QAs with reasoning annotations. Additionally, we include 102K QAs from existing works llava-videonext-qaperpceptiontestclevrstar.
  • Figure 4: Data generation process for the LongVideo-Reason dataset. This process begins with segmenting videos into 10-second clips and generating captions for each clip using NVILA-8B. Then based on the captions of all clips in a video, we generate question-answer pairs that involve reasoning across the content of the whole video, along with the reasoning annotations using a leading open-source reasoning LLM. Reasoning questions are categorized into Temporal, Goal and Purpose, Spatial, and Plot and Narrative. Finally, the reasoning annotations are reformatted for conciseness and alignment with video details. We present a more detailed figure of data generation process in Figure \ref{['fig:data-generation-appendix']}.
  • Figure 5: The LongVILA-R1 training pipeline. LongVILA-R1 builds upon the base training pipeline for LongVILA. MM-SP is further employed for SFT on long video understanding tasks with long CoT. Then, reinforcement scaling learning is conducted through Multi-modal Reinforcement Sequential Parallelism (MR-SP).
  • ...and 11 more figures