Are Video Reasoning Models Ready to Go Outside?

Yangfan He; Changgyu Boo; Jaehong Yoon

Are Video Reasoning Models Ready to Go Outside?

Yangfan He, Changgyu Boo, Jaehong Yoon

TL;DR

ROVA is proposed, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions, and introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability.

Abstract

In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.

Are Video Reasoning Models Ready to Go Outside?

TL;DR

Abstract

Paper Structure (60 sections, 1 theorem, 31 equations, 19 figures, 15 tables, 2 algorithms)

This paper contains 60 sections, 1 theorem, 31 equations, 19 figures, 15 tables, 2 algorithms.

Introduction
Related Work
Training Robust Video Reasoning Models with ROVA
Learning with Structured Spatio-Temporal Corruption
Self-Reflective Difficulty-Aware Training
Dual-Branch Alignment Optimization
Evaluating Video Reasoning under Various Realistic Disturbances
Experiment
Implementation Details.
Main Results
Ablation Study and Analysis
Qualitative Analysis
Conclusion
Limitation
Full Details of Dataset Construction
...and 45 more sections

Key Result

Proposition 1

Let $\rho_t$ denote the effective training ratio at step $t$, and let $\bar{\rho} = \frac{1}{T_{\text{RL}}} \sum_{t=1}^{T_{\text{RL}}} \rho_t$ be the average training ratio over $T_{\text{RL}}$ RL steps. Ignoring the amortized memory re-evaluation cost (which occurs every 50 steps), the per-step cos When $\bar{\rho} < 1$ (i.e., the curriculum discards some fraction of samples), and $C_{\text{judge

Figures (19)

Figure 1: Failure cases of Qwen2.5-VL under two representative perturbations: (a) occlusion (left) and (b) adverse weather (right). The model incorrectly predicts Turn Left" under occlusion and Turn Right" under fog, despite the ground-truth being "Go Ahead" in both cases, demonstrating how realistic perturbations mislead reasoning and motivating the need for robustness-aware training.
Figure 2: Overview of ROVA: (1) structured spatio-temporal corruption that generates realistic perturbations, (2) self-reflective evaluation with difficulty-aware online training that adaptively prioritizes informative samples, and (3) dual-branch alignment reward modeling that enforces output consistency between clean and perturbed inputs.
Figure 3: Overview of the perturbation types in PVRBench.
Figure 4: Analysis of Self-Reflective Evaluation and Difficulty-Aware Training for ROVA during the first Epoch of Qwen-VL-2.5-7B Training.
Figure 5: Ablation studies of ROVA. (a) Impact of individual components on answer accuracy. (b) Comparison of corruption mask strategies across perturbation types. Experiments are conducted using the Qwen3-VL-13B model trained for 3 epochs.
...and 14 more figures

Theorems & Definitions (2)

Proposition 1: Amortized cost advantage of ROVA
proof

Are Video Reasoning Models Ready to Go Outside?

TL;DR

Abstract

Are Video Reasoning Models Ready to Go Outside?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (2)