Scaling RL to Long Videos

Yukang Chen; Wei Huang; Baifeng Shi; Qinghao Hu; Hanrong Ye; Ligeng Zhu; Zhijian Liu; Pavlo Molchanov; Jan Kautz; Xiaojuan Qi; Sifei Liu; Hongxu Yin; Yao Lu; Song Han

Scaling RL to Long Videos

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han

TL;DR

LongVILA-R1 tackles the challenge of long-video reasoning by delivering a full-stack solution: a large-scale LongVideo-Reason dataset, a two-stage training regimen (CoT-SFT then RL), and a scalable MR-SP infrastructure for efficient long-context reinforcement learning. The approach yields strong results on multiple video benchmarks, supports up to 8,192 video frames per video, and achieves notable RL-speedups on long-context tasks. By releasing the training system publicly, the work paves the way for broader adoption of RL-based long-video reasoning across modalities and model families. The framework holds promise for robotics, embodied AI, and complex video analytics, while acknowledging practical limitations around compute demands and privacy considerations.

Abstract

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65.1% and 71.1% accuracy on VideoMME without and with subtitles, respectively, and consistently outperforming LongVILA-7B across multiple benchmarks. Moreover, LongVILA-R1-7B supports processing up to 8,192 video frames per video, and configurable FPS settings. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames).

Scaling RL to Long Videos

TL;DR

Abstract

Scaling RL to Long Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)