Table of Contents
Fetching ...

Are Video Reasoning Models Ready to Go Outside?

Yangfan He, Changgyu Boo, Jaehong Yoon

TL;DR

ROVA is proposed, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions, and introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability.

Abstract

In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.

Are Video Reasoning Models Ready to Go Outside?

TL;DR

ROVA is proposed, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions, and introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability.

Abstract

In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.
Paper Structure (60 sections, 1 theorem, 31 equations, 19 figures, 15 tables, 2 algorithms)

This paper contains 60 sections, 1 theorem, 31 equations, 19 figures, 15 tables, 2 algorithms.

Key Result

Proposition 1

Let $\rho_t$ denote the effective training ratio at step $t$, and let $\bar{\rho} = \frac{1}{T_{\text{RL}}} \sum_{t=1}^{T_{\text{RL}}} \rho_t$ be the average training ratio over $T_{\text{RL}}$ RL steps. Ignoring the amortized memory re-evaluation cost (which occurs every 50 steps), the per-step cos When $\bar{\rho} < 1$ (i.e., the curriculum discards some fraction of samples), and $C_{\text{judge

Figures (19)

  • Figure 1: Failure cases of Qwen2.5-VL under two representative perturbations: (a) occlusion (left) and (b) adverse weather (right). The model incorrectly predicts Turn Left" under occlusion and Turn Right" under fog, despite the ground-truth being "Go Ahead" in both cases, demonstrating how realistic perturbations mislead reasoning and motivating the need for robustness-aware training.
  • Figure 2: Overview of ROVA: (1) structured spatio-temporal corruption that generates realistic perturbations, (2) self-reflective evaluation with difficulty-aware online training that adaptively prioritizes informative samples, and (3) dual-branch alignment reward modeling that enforces output consistency between clean and perturbed inputs.
  • Figure 3: Overview of the perturbation types in PVRBench.
  • Figure 4: Analysis of Self-Reflective Evaluation and Difficulty-Aware Training for ROVA during the first Epoch of Qwen-VL-2.5-7B Training.
  • Figure 5: Ablation studies of ROVA. (a) Impact of individual components on answer accuracy. (b) Comparison of corruption mask strategies across perturbation types. Experiments are conducted using the Qwen3-VL-13B model trained for 3 epochs.
  • ...and 14 more figures

Theorems & Definitions (2)

  • Proposition 1: Amortized cost advantage of ROVA
  • proof