Table of Contents
Fetching ...

Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

Bo Feng, Zhengfeng Lai, Shiyu Li, Zizhen Wang, Simon Wang, Ping Huang, Meng Cao

TL;DR

Video benchmarks often conflate language priors with true temporal understanding, misleading assessments of video LLMs. The paper presents VBenchComp, an automated pipeline that classifies benchmark questions into LLM-Answerable, Semantic, Temporal, and Other to isolate temporal reasoning and diagnose benchmark composition. Experimental results across seven benchmarks reveal biases and highlight when traditional single scores overestimate temporal understanding, enabling targeted improvements in benchmark design. The proposed VBenchComp Score, based on semantically and temporally informative questions, achieves similar ranking with fewer items, offering a more efficient and interpretable metric for advancing video LLM research.

Abstract

Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model's temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos: (1) strong language priors, where models can answer questions without watching the video; and (2) shuffling invariance, where models maintain similar performance on certain questions even when video frames are temporally shuffled. To alleviate these issues, we propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal. Specifically, LLM-Answerable questions can be answered without viewing the video; Semantic questions remain answerable even when the video frames are shuffled; and Temporal questions require understanding the correct temporal order of frames. The rest of the questions are labeled as Others. This can enable fine-grained evaluation of different capabilities of a video LLM. Our analysis reveals nuanced model weaknesses that are hidden by traditional overall scores, and we offer insights and recommendations for designing future benchmarks that more accurately assess video LLMs.

Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

TL;DR

Video benchmarks often conflate language priors with true temporal understanding, misleading assessments of video LLMs. The paper presents VBenchComp, an automated pipeline that classifies benchmark questions into LLM-Answerable, Semantic, Temporal, and Other to isolate temporal reasoning and diagnose benchmark composition. Experimental results across seven benchmarks reveal biases and highlight when traditional single scores overestimate temporal understanding, enabling targeted improvements in benchmark design. The proposed VBenchComp Score, based on semantically and temporally informative questions, achieves similar ranking with fewer items, offering a more efficient and interpretable metric for advancing video LLM research.

Abstract

Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model's temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos: (1) strong language priors, where models can answer questions without watching the video; and (2) shuffling invariance, where models maintain similar performance on certain questions even when video frames are temporally shuffled. To alleviate these issues, we propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal. Specifically, LLM-Answerable questions can be answered without viewing the video; Semantic questions remain answerable even when the video frames are shuffled; and Temporal questions require understanding the correct temporal order of frames. The rest of the questions are labeled as Others. This can enable fine-grained evaluation of different capabilities of a video LLM. Our analysis reveals nuanced model weaknesses that are hidden by traditional overall scores, and we offer insights and recommendations for designing future benchmarks that more accurately assess video LLMs.

Paper Structure

This paper contains 18 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Examples of LLM-Answerable, Semantic and Temporal questions in VideoMME fu2024video: (Top) The model uses LLM's prior knowledge to answer correctly without the need of video; (Middle) The model relies on semantic understanding to answer without requiring temporal comprehension; (Bottom) The model relies on comprehensive temporal understanding to answer.
  • Figure 2: Performance of different MLLMs without videos as the input on four benchmarks.
  • Figure 3: After shuffling the extracted frames, the scores of each model remain unshaken across all benchmarks. *Frame settings: (a), (d) uses 128 frames for VideoMME-long, others use 64 frames; (b) uses $10_{\text{slow}} + 50_{\text{fast}}$ frames for all benchmarks; (c) uses 16 frames for all benchmarks.
  • Figure 4: An overview of our standardized protocol: benchmark questions are categorized into four groups. Questions answerable by both GPT-4o and Gemini without video are classified as LLM-Answerable. For the remaining questions, we apply random shuffles to the extracted frames twice: if both models answer correctly before and after shuffling, the question is classified as Semantic. If one model answers correctly before but fails after shuffling, the question is classified as Temporal. All other questions are categorized as Others.
  • Figure 5: VBenchComp scores are aligned with the original scores but they can better evaluate the overall video LLM performance with less questions. The temporal video understanding capability of models under the trend line can be potentially over-estimated in the original benchmarks.