Table of Contents
Fetching ...

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, Mohit Bansal

TL;DR

Video-RTS addresses the data inefficiency of RL-based video reasoning with LLMs by replacing costly supervised fine-tuning with pure outcome-based reinforcement learning. It couples GRPO-based RL, a simple reward design that encourages explicit reasoning and correct answers, with a video-adaptive sparse-to-dense test-time scaling that expands temporal context only as needed by consensus. The approach achieves competitive or superior results across five benchmarks while using only 6K training samples, and it demonstrates additive gains when combining RL with adaptive inference. This yields a practical, frame-efficient framework for scalable, interpretable video reasoning in multimodal settings.

Abstract

Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and fine-tuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Building on observations about the data scaling, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by 2.4% in accuracy using only 3.6% training samples. Specifically, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning

TL;DR

Video-RTS addresses the data inefficiency of RL-based video reasoning with LLMs by replacing costly supervised fine-tuning with pure outcome-based reinforcement learning. It couples GRPO-based RL, a simple reward design that encourages explicit reasoning and correct answers, with a video-adaptive sparse-to-dense test-time scaling that expands temporal context only as needed by consensus. The approach achieves competitive or superior results across five benchmarks while using only 6K training samples, and it demonstrates additive gains when combining RL with adaptive inference. This yields a practical, frame-efficient framework for scalable, interpretable video reasoning in multimodal settings.

Abstract

Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and fine-tuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Building on observations about the data scaling, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by 2.4% in accuracy using only 3.6% training samples. Specifically, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.

Paper Structure

This paper contains 33 sections, 4 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Training and inference recipe comparison between Video-R1 feng2025video and our Video-RTS. While (a) Video-R1 uses a two-stage pipeline with SFT and RL, (b) Video-RTS adapts a pure-RL approach with output-based rewards for better data efficiency. We further enhance the reasoning of Video-RTS by proposing dynamic sparse-to-dense video test-time scaling. The format reward is omitted, as both models use it.
  • Figure 2: The overview of Video-RTS. The training phase (Top) adapts GRPO-based RL to optimize the MLLM with outcome and accuracy rewards. In inference (Bottom), Video-RTS conducts dynamic sparse-to-dense reasoning by traversing sampled frames for generating rationales. If answers are in consensus, it returns an answer; otherwise, it samples denser frames.
  • Figure 3: Analysis on the number of training samples for pure-RL training.
  • Figure 4: Illustration of dynamic sparse-to-dense reasoning in Video-RTS.Video-RTS identifies when the sampled visual information is insufficient for accurately reasoning about the input query (reasoning highlighted in yellow background), often leading to no consensus among intermediate reasoning steps and potentially inaccurate predictions (in red). Video-RTS enables the model to adaptively refine its reasoning process (in green), through the proposed dynamic sparse-to-dense reasoning mechanism, achieving accurate and consensus-driven predictions.