Table of Contents
Fetching ...

Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models

Sethuraman T, Savya Khosla, Aditi Tiwari, Vidya Ganesh, Rakshana Jayaprakash, Aditya Jain, Vignesh Srinivasakumar, Onkar Kishor Susladkar, Srinidhi Sunkara, Aditya Shanmugham, Rakesh Vaideeswaran, Abbaas Alif Mohamed Nishar, Simon Jenni, Derek Hoiem

TL;DR

This work presents REVEAL, a diagnostic benchmark to stress-test Video-Language Models (VidLMs) for grounded video understanding, temporal reasoning, and motion perception. It introduces five controlled stress tests—video sycophancy, language-only shortcuts, temporal expectation bias, spatiotemporal occlusion, and camera-motion sensitivity—coupled with an automated data-generation pipeline that perturb existing datasets for reproducible evaluation. Across open- and closed-source VidLMs, REVEAL reveals systematic weaknesses: models often rely on linguistic priors rather than visual evidence, misinterpret temporal sequences, and struggle with camera motion and cross-frame integration, while humans remain robust. These findings expose a gap between high in-distribution accuracy and genuine video-grounded reasoning, and they supply a scalable tool to diagnose, understand, and guide improvements in temporal grounding, visual understanding, and multimodal fusion in VidLMs.

Abstract

This work investigates a fundamental question: Do Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion? Our investigation shows that, surprisingly, they often do not. We introduce REVEAL{}, a diagnostic benchmark that probes fundamental weaknesses of contemporary VidLMs through five controlled stress tests; assessing temporal expectation bias, reliance on language-only shortcuts, video sycophancy, camera motion sensitivity, and robustness to spatiotemporal occlusion. We test leading open- and closed-source VidLMs and find that these models confidently describe reversed scenes as forward, answer questions while neglecting video content, agree with false claims, struggle with basic camera motion, and fail to aggregate temporal information amidst simple spatiotemporal masking. Humans, on the other hand, succeed at these tasks with ease. Alongside our benchmark, we provide a data pipeline that automatically generates diagnostic examples for our stress tests, enabling broader and more scalable evaluation. We will release our benchmark and code to support future research.

Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models

TL;DR

This work presents REVEAL, a diagnostic benchmark to stress-test Video-Language Models (VidLMs) for grounded video understanding, temporal reasoning, and motion perception. It introduces five controlled stress tests—video sycophancy, language-only shortcuts, temporal expectation bias, spatiotemporal occlusion, and camera-motion sensitivity—coupled with an automated data-generation pipeline that perturb existing datasets for reproducible evaluation. Across open- and closed-source VidLMs, REVEAL reveals systematic weaknesses: models often rely on linguistic priors rather than visual evidence, misinterpret temporal sequences, and struggle with camera motion and cross-frame integration, while humans remain robust. These findings expose a gap between high in-distribution accuracy and genuine video-grounded reasoning, and they supply a scalable tool to diagnose, understand, and guide improvements in temporal grounding, visual understanding, and multimodal fusion in VidLMs.

Abstract

This work investigates a fundamental question: Do Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion? Our investigation shows that, surprisingly, they often do not. We introduce REVEAL{}, a diagnostic benchmark that probes fundamental weaknesses of contemporary VidLMs through five controlled stress tests; assessing temporal expectation bias, reliance on language-only shortcuts, video sycophancy, camera motion sensitivity, and robustness to spatiotemporal occlusion. We test leading open- and closed-source VidLMs and find that these models confidently describe reversed scenes as forward, answer questions while neglecting video content, agree with false claims, struggle with basic camera motion, and fail to aggregate temporal information amidst simple spatiotemporal masking. Humans, on the other hand, succeed at these tasks with ease. Alongside our benchmark, we provide a data pipeline that automatically generates diagnostic examples for our stress tests, enabling broader and more scalable evaluation. We will release our benchmark and code to support future research.
Paper Structure (44 sections, 14 figures, 12 tables, 3 algorithms)

This paper contains 44 sections, 14 figures, 12 tables, 3 algorithms.

Figures (14)

  • Figure 1: REVEAL exposes failure modes in state-of-the-art VidLMs.Top: Models rely on language cues rather than visual evidence, failing to recognize that the video is blank when an answer can be inferred from text alone. Middle: Models misinterpret temporally reversed events, indicating weak temporal grounding. Bottom: Models confuse camera motion, failing to detect a lane change. Humans perform these tasks reliably.
  • Figure 2: Video Sycophancy. Illustration of four sycophancy task types: (i) Temporal Reordering, (ii) Fine-grained Action Substitution, (iii) State Reversal and (iv) Object-Attribute Modification. Across all tasks, VidLMs often agree with false user claims, highlighting their difficulty in rejecting incorrect descriptions when contradicted by the video.
  • Figure 3: Reliance on Language-Only Shortcuts. Despite clear visual evidence, VidLMs default to the textually plausible option (A $\to$ B $\to$ C $\to$ D) for ordering and action queries, revealing a strong reliance on linguistic cues.
  • Figure 4: Temporal Expectation Bias. When the video is reversed, the candle visibly rebuilds instead of melting, yet the VidLM still describes a normal forward melting process. This illustrates how strong event priors can override the actual temporal direction observed in the video.
  • Figure 5: Robustness to Spatiotemporal Occlusion. Each frame is duplicated four times with disjoint visible regions spread across the duplicates. Humans integrate these fragments to recover the action, but VidLMs often fail, indicating weak spatiotemporal fusion.
  • ...and 9 more figures