Table of Contents
Fetching ...

HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin

TL;DR

HERBench introduces a high-evidential-demand benchmark for VideoQA that requires cross-time integration of at least three distinct evidential cues. It formalizes the Minimum Required Frame-Set (MRFS) to quantify the amount of visual evidence needed and demonstrates that existing Video-LLMs struggle with both retrieving the right frames and integrating dispersed information. The dataset comprises 26,806 multiple-choice questions across 12 compositional tasks and 336 videos, with rigorous construction pipelines and human verification to prevent shortcuts. Across 13 state-of-the-art models, performance remains limited (mean ~38%), revealing substantial room for improvement in both frame retrieval and multi-evidence fusion. By making cross-time evidence unavoidable and measurable, HERBench provides a principled target for advancing robust, compositional video understanding and diagnostic tools for future Video-LLMs.

Abstract

Video Large Language Models (Video-LLMs) are rapidly improving, yet current Video Question Answering (VideoQA) benchmarks often allow questions to be answered from a single salient cue, under-testing reasoning that must aggregate multiple, temporally separated visual evidence. We present HERBench, a VideoQA benchmark purpose-built to assess multi-evidence integration across time. Each question requires aggregating at least three non-overlapping evidential cues across distinct video segments, so neither language priors nor a single snapshot can suffice. HERBench comprises 26K five-way multiple-choice questions organized into twelve compositional tasks that probe identity binding, cross-entity relations, temporal ordering, co-occurrence verification, and counting. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes substantially higher demand than prior datasets (mean MRFS 5.5 vs. 2.6-4.2). Evaluating 13 state-of-the-art Video-LLMs on HERBench reveals pervasive failures: accuracies of 31-42% are only slightly above the 20% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. By making cross-time evidence both unavoidable and quantifiable, HERBench establishes a principled target for advancing robust, compositional video understanding.

HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

TL;DR

HERBench introduces a high-evidential-demand benchmark for VideoQA that requires cross-time integration of at least three distinct evidential cues. It formalizes the Minimum Required Frame-Set (MRFS) to quantify the amount of visual evidence needed and demonstrates that existing Video-LLMs struggle with both retrieving the right frames and integrating dispersed information. The dataset comprises 26,806 multiple-choice questions across 12 compositional tasks and 336 videos, with rigorous construction pipelines and human verification to prevent shortcuts. Across 13 state-of-the-art models, performance remains limited (mean ~38%), revealing substantial room for improvement in both frame retrieval and multi-evidence fusion. By making cross-time evidence unavoidable and measurable, HERBench provides a principled target for advancing robust, compositional video understanding and diagnostic tools for future Video-LLMs.

Abstract

Video Large Language Models (Video-LLMs) are rapidly improving, yet current Video Question Answering (VideoQA) benchmarks often allow questions to be answered from a single salient cue, under-testing reasoning that must aggregate multiple, temporally separated visual evidence. We present HERBench, a VideoQA benchmark purpose-built to assess multi-evidence integration across time. Each question requires aggregating at least three non-overlapping evidential cues across distinct video segments, so neither language priors nor a single snapshot can suffice. HERBench comprises 26K five-way multiple-choice questions organized into twelve compositional tasks that probe identity binding, cross-entity relations, temporal ordering, co-occurrence verification, and counting. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes substantially higher demand than prior datasets (mean MRFS 5.5 vs. 2.6-4.2). Evaluating 13 state-of-the-art Video-LLMs on HERBench reveals pervasive failures: accuracies of 31-42% are only slightly above the 20% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. By making cross-time evidence both unavoidable and quantifiable, HERBench establishes a principled target for advancing robust, compositional video understanding.

Paper Structure

This paper contains 70 sections, 3 equations, 21 figures, 5 tables.

Figures (21)

  • Figure 1: From Single-Cue to Multi-Evidence Integration. While existing benchmarks like MVBench li2024mvbench (top) often focus on short-term attributes solvable via single salient frames or language priors, HERBench (bottom) enforces a high Evidential Requirement (ER). In this Temporal Shot Ordering example, the model must identify and temporally bind four distinct, non-overlapping visual evidence dispersed across the video to reconstruct the correct sequence. This design ensures that successful answering requires genuine multi-evidence integration rather than reliance on static shortcuts.
  • Figure 2: Task taxonomy of HERBench. We organize 12 fine-grained compositional tasks into four essential reasoning families: (1) Temporal Reasoning & Chronology, (2) Referring & Tracking, (3) Global Consistency & Verification, and (4) Multi-Entity Aggregation & Numeracy. Unlike existing benchmarks that may allow for single-frame shortcuts, every task in HERBench is constructed to enforce a High Evidential Requirement, requiring models to aggregate at least three distinct, temporally separated visual cues ($k \ge 3$) to derive the correct answer.
  • Figure 3: HERBench Data Construction Pipeline. We employ a tripartite pipeline. (Left) Videos are processed through three parallel streams: 1) Object Tracking and Trajectory Analysis (via RF-DETR and DeepSORT) to produce targets to generate disentangled Appearance (A) and Behavior (B) cards; 2) Shot Segmentation using shot detection with an MLLM description for producing scene descriptions; and 3) Ground Truth Integration refining human verified raw event logs. (Middle) These refined data input are controlled via a Manual Review and then input into an Oriented Task Programming module that programmatically compiles the 12 compositional tasks. (Right) The pipeline enforces rigorous quality control through expert Manual Review and a Text-Only Filtering stage to eliminate language priors, ensuring all final Multiple Choice Questions (MCQs) enforce multi-evidence integration.
  • Figure 4: Left: Wordcloud of frequent terms in HERBench queries. Center: Distribution of samples across source datasets. Right: Number of questions per task category.
  • Figure 5: Top-1 frame share under oracle-only frames. Violin/box plots show the distribution of the maximum normalized frame-importance share across oracle-only frames for three models (InternVL3.5-14B, Ovis-2.5, Qwen3-VL-8B), split by Correct vs. Incorrect predictions. For each item, we compute leave-one-out deltas of the log-probability of the model’s predicted option and normalize them to per-frame shares; the plotted statistic is the largest share (Top-1). Correct predictions allocate credit more evenly across frames (typically $\sim$0.5), whereas errors over-concentrate on a single frame (often $\sim0.8$), indicating insufficient multi-evidence fusion even when only evidence-bearing frames are provided.
  • ...and 16 more figures