Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models
Sethuraman T, Savya Khosla, Aditi Tiwari, Vidya Ganesh, Rakshana Jayaprakash, Aditya Jain, Vignesh Srinivasakumar, Onkar Kishor Susladkar, Srinidhi Sunkara, Aditya Shanmugham, Rakesh Vaideeswaran, Abbaas Alif Mohamed Nishar, Simon Jenni, Derek Hoiem
TL;DR
This work presents REVEAL, a diagnostic benchmark to stress-test Video-Language Models (VidLMs) for grounded video understanding, temporal reasoning, and motion perception. It introduces five controlled stress tests—video sycophancy, language-only shortcuts, temporal expectation bias, spatiotemporal occlusion, and camera-motion sensitivity—coupled with an automated data-generation pipeline that perturb existing datasets for reproducible evaluation. Across open- and closed-source VidLMs, REVEAL reveals systematic weaknesses: models often rely on linguistic priors rather than visual evidence, misinterpret temporal sequences, and struggle with camera motion and cross-frame integration, while humans remain robust. These findings expose a gap between high in-distribution accuracy and genuine video-grounded reasoning, and they supply a scalable tool to diagnose, understand, and guide improvements in temporal grounding, visual understanding, and multimodal fusion in VidLMs.
Abstract
This work investigates a fundamental question: Do Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion? Our investigation shows that, surprisingly, they often do not. We introduce REVEAL{}, a diagnostic benchmark that probes fundamental weaknesses of contemporary VidLMs through five controlled stress tests; assessing temporal expectation bias, reliance on language-only shortcuts, video sycophancy, camera motion sensitivity, and robustness to spatiotemporal occlusion. We test leading open- and closed-source VidLMs and find that these models confidently describe reversed scenes as forward, answer questions while neglecting video content, agree with false claims, struggle with basic camera motion, and fail to aggregate temporal information amidst simple spatiotemporal masking. Humans, on the other hand, succeed at these tasks with ease. Alongside our benchmark, we provide a data pipeline that automatically generates diagnostic examples for our stress tests, enabling broader and more scalable evaluation. We will release our benchmark and code to support future research.
