Table of Contents
Fetching ...

ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning

Ziqiang Xu, Qi Dai, Tian Xie, Yifan Yang, Kai Qiu, DongDong Chen, Zuxuan Wu, Chong Luo

TL;DR

ViaRL tackles temporal grounding in video understanding by learning a frame selector through rule-based reinforcement learning, using the downstream multimodal language model's answer accuracy as the reward. The authors introduce the Visual Iterated Amplification Learning System, which alternates training the frame selector and the answer model in cycles to progressively improve performance and robustness. Across VideoMME, LVBench, and MLVU benchmarks, ViaRL yields consistent gains, notably achieving about a 15% improvement on Needle QA, and demonstrates strong generalization with scalable training without costly frame-annotations. The framework advances intention-driven video understanding by integrating structured rewards, interpretability-promoting prompts, and iterative optimization to closely mimic human-like perceptual learning.

Abstract

Video understanding is inherently intention-driven-humans naturally focus on relevant frames based on their goals. Recent advancements in multimodal large language models (MLLMs) have enabled flexible query-driven reasoning; however, video-based frameworks like Video Chain-of-Thought lack direct training signals to effectively identify relevant frames. Current approaches often rely on heuristic methods or pseudo-label supervised annotations, which are both costly and limited in scalability across diverse scenarios. To overcome these challenges, we introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in intention-driven video understanding. An iterated amplification strategy is adopted to perform alternating cyclic training in the video CoT system, where each component undergoes iterative cycles of refinement to improve its capabilities. ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error, eliminating the need for expensive annotations while closely aligning with human-like learning processes. Comprehensive experiments across multiple benchmarks, including VideoMME, LVBench, and MLVU, demonstrate that ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks, highlighting its effectiveness and scalability. Notably, ViaRL achieves a nearly 15\% improvement on Needle QA, a subset of MLVU, which is required to search a specific needle within a long video and regarded as one of the most suitable benchmarks for evaluating temporal grounding.

ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning

TL;DR

ViaRL tackles temporal grounding in video understanding by learning a frame selector through rule-based reinforcement learning, using the downstream multimodal language model's answer accuracy as the reward. The authors introduce the Visual Iterated Amplification Learning System, which alternates training the frame selector and the answer model in cycles to progressively improve performance and robustness. Across VideoMME, LVBench, and MLVU benchmarks, ViaRL yields consistent gains, notably achieving about a 15% improvement on Needle QA, and demonstrates strong generalization with scalable training without costly frame-annotations. The framework advances intention-driven video understanding by integrating structured rewards, interpretability-promoting prompts, and iterative optimization to closely mimic human-like perceptual learning.

Abstract

Video understanding is inherently intention-driven-humans naturally focus on relevant frames based on their goals. Recent advancements in multimodal large language models (MLLMs) have enabled flexible query-driven reasoning; however, video-based frameworks like Video Chain-of-Thought lack direct training signals to effectively identify relevant frames. Current approaches often rely on heuristic methods or pseudo-label supervised annotations, which are both costly and limited in scalability across diverse scenarios. To overcome these challenges, we introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in intention-driven video understanding. An iterated amplification strategy is adopted to perform alternating cyclic training in the video CoT system, where each component undergoes iterative cycles of refinement to improve its capabilities. ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error, eliminating the need for expensive annotations while closely aligning with human-like learning processes. Comprehensive experiments across multiple benchmarks, including VideoMME, LVBench, and MLVU, demonstrate that ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks, highlighting its effectiveness and scalability. Notably, ViaRL achieves a nearly 15\% improvement on Needle QA, a subset of MLVU, which is required to search a specific needle within a long video and regarded as one of the most suitable benchmarks for evaluating temporal grounding.

Paper Structure

This paper contains 25 sections, 7 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The overall architecture of our approach.
  • Figure 2: Schematic of our Visual Iterated Amplification System implementation in each cycle.
  • Figure 3: ViaRL improves the baseline MLLMs for video understanding. The $N$ selected frames are shown. The most relevant frame is indicated by green box in each row.
  • Figure 4: Performance of our ViaRL over multiple cycles and stages, attributing to the intertwined improvement of models capability during the iterative process. The horizontal axis $(i,j)$ represents the $j_{th}$ stage of the $i_{th}$ cycle. For example, $(2,1)$ indicates the evaluation model $M_1$ has learned twice, and $M_2$ has learned once. The initial state is denoted as $(0,0)$.
  • Figure 5: Visualization across diverse scenarios on VideoMME.
  • ...and 2 more figures