Table of Contents
Fetching ...

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

Jiahao Meng, Tan Yue, Qi Xu, Haochen Wang, Zhongwei Ren, Weisong Liu, Yuhao Wang, Renrui Zhang, Yunhai Tong, Haodong Duan

Abstract

Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence requirements. Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3). When grounding constraints are imposed, performance drops sharply: No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5), with most failing to achieve any correct grounded predictions. These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a bottleneck for long-video QA. We further analyze performance across minimal evidence spans, atomic abilities, and inference paradigms, providing insights for future research in grounded video reasoning. The benchmark and code will be made publicly available.

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

Abstract

Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence requirements. Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3). When grounding constraints are imposed, performance drops sharply: No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5), with most failing to achieve any correct grounded predictions. These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a bottleneck for long-video QA. We further analyze performance across minimal evidence spans, atomic abilities, and inference paradigms, providing insights for future research in grounded video reasoning. The benchmark and code will be made publicly available.

Paper Structure

This paper contains 18 sections, 2 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: We introduce VideoZeroBench, a challenging long-video understanding benchmark with hierarchical spatio-temporal evidence verification. Frontier models achieve only 17% accuracy in standard video QA and no more than 1% when correct spatio-temporal grounding is required.
  • Figure 2: Data construction and statistics of VideoZeroBench. All questions and evidence are manually annotated and verified. The benchmark spans 13 video domains and covers 11 atomic capabilities grouped into Detailed Perception(A), Spatial&Temporal Reasoning(B), and Semantic&Cross-Modal Reasoning(C). The bottom plots show the distributions of video length and minimal evidence span across categories.
  • Figure 3: Performance comparison across atomic abilities, video categories and minimal temporal evidence span.
  • Figure 4: Examples from VideoZeroBench with annotated spatio-temporal evidence and model predictions. (1) Models localize redundant or incorrect evidence and misinterpret relative spatial relationships. (2) Correct answers may still lack precise spatial grounding for all relevant objects. (3) Models fail to integrate evidence across multiple temporal segments and correctly determine turning directions. (4) Models struggle to combine auditory cues with visual evidence for reasoning.
  • Figure 5: Annotation interface used for question construction and evidence labeling.
  • ...and 6 more figures