Table of Contents
Fetching ...

TempCore: Are Video QA Benchmarks Temporally Grounded? A Frame Selection Sensitivity Analysis and Benchmark

Hyunjong Ok, Jaeho Lee

Abstract

Vision-language models (VLMs) can ingest only a limited number of video frames, making frame selection a practical necessity. But do current Video QA benchmarks genuinely require temporal frame selection, or can most questions be answered regardless of which frames are shown? We introduce Frame Selection Sensitivity (FSS), a per-sample diagnostic that measures how much VLM accuracy changes when the most relevant frames are replaced with the least relevant ones. Across six benchmarks and eight VLMs, we find that a large majority of samples are frame-agnostic: only a minority are genuinely sensitive to frame choice. Combining FSS with a Language Independence Score (LIS) reveals that merely 8--33% of samples are Temporally Sensitive. We construct TempCore, compact evaluation subsets that isolate these temporal samples from existing benchmarks, and will release code and per-sample annotations upon publication.

TempCore: Are Video QA Benchmarks Temporally Grounded? A Frame Selection Sensitivity Analysis and Benchmark

Abstract

Vision-language models (VLMs) can ingest only a limited number of video frames, making frame selection a practical necessity. But do current Video QA benchmarks genuinely require temporal frame selection, or can most questions be answered regardless of which frames are shown? We introduce Frame Selection Sensitivity (FSS), a per-sample diagnostic that measures how much VLM accuracy changes when the most relevant frames are replaced with the least relevant ones. Across six benchmarks and eight VLMs, we find that a large majority of samples are frame-agnostic: only a minority are genuinely sensitive to frame choice. Combining FSS with a Language Independence Score (LIS) reveals that merely 8--33% of samples are Temporally Sensitive. We construct TempCore, compact evaluation subsets that isolate these temporal samples from existing benchmarks, and will release code and per-sample annotations upon publication.

Paper Structure

This paper contains 69 sections, 6 equations, 13 figures, 23 tables.

Figures (13)

  • Figure 1: Grad-CAM attention maps for question-text vs. answer-text scoring. Question-text attention is diffuse, while answer-text attention concentrates on the relevant region, illustrating the noise introduced by interrogative cues.
  • Figure 2: MaxProb vs. MinProb accuracy (8-model average). Frame selection yields negligible effect on short-video benchmarks but consistent gains on long-video benchmarks. LVB denotes LongVideoBench.
  • Figure 3: Oracle gaps reveal large untapped headroom for frame selection. Uniform-sampling vs. window-oracle accuracy (model average). Long-video gaps far exceed short-video gaps, confirming that temporal frame selection is most consequential for extended videos. LVB denotes LongVideoBench.
  • Figure 4: FSS scores concentrate near zero, confirming non-temporal dominance. Aggregated FSS distributions for short-video and long-video benchmarks. Dashed lines mark classification thresholds ($\tau = \pm 0.15$). The heavy mass near zero shows that frame selection has negligible effect on the majority of samples. Per-benchmark breakdowns are in \ref{['app:fss_detail']}.
  • Figure 5: Overview of the TempCore construction pipeline. Stage 1 filters out Trivial samples (solvable from language priors, $\textsc{LIS}{} \leq \lambda$). Stage 2 classifies the remaining vision-dependent samples along the FSS axis into three categories: Temporally Grounded ($\textsc{FSS}{} > \tau$), Frame-Agnostic ($|\textsc{FSS}{}| \leq \tau$), and Visual Bias ($\textsc{FSS}{} < -\tau$). The Temporal Purity Index (TPI) summarizes the fraction of vision-dependent samples that are temporally sensitive.
  • ...and 8 more figures