HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

Dan Ben-Ami; Gabriele Serussi; Kobi Cohen; Chaim Baskin

HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin

TL;DR

HERBench introduces a high-evidential-demand benchmark for VideoQA that requires cross-time integration of at least three distinct evidential cues. It formalizes the Minimum Required Frame-Set (MRFS) to quantify the amount of visual evidence needed and demonstrates that existing Video-LLMs struggle with both retrieving the right frames and integrating dispersed information. The dataset comprises 26,806 multiple-choice questions across 12 compositional tasks and 336 videos, with rigorous construction pipelines and human verification to prevent shortcuts. Across 13 state-of-the-art models, performance remains limited (mean ~38%), revealing substantial room for improvement in both frame retrieval and multi-evidence fusion. By making cross-time evidence unavoidable and measurable, HERBench provides a principled target for advancing robust, compositional video understanding and diagnostic tools for future Video-LLMs.

Abstract

Video Large Language Models (Video-LLMs) are rapidly improving, yet current Video Question Answering (VideoQA) benchmarks often allow questions to be answered from a single salient cue, under-testing reasoning that must aggregate multiple, temporally separated visual evidence. We present HERBench, a VideoQA benchmark purpose-built to assess multi-evidence integration across time. Each question requires aggregating at least three non-overlapping evidential cues across distinct video segments, so neither language priors nor a single snapshot can suffice. HERBench comprises 26K five-way multiple-choice questions organized into twelve compositional tasks that probe identity binding, cross-entity relations, temporal ordering, co-occurrence verification, and counting. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes substantially higher demand than prior datasets (mean MRFS 5.5 vs. 2.6-4.2). Evaluating 13 state-of-the-art Video-LLMs on HERBench reveals pervasive failures: accuracies of 31-42% are only slightly above the 20% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. By making cross-time evidence both unavoidable and quantifiable, HERBench establishes a principled target for advancing robust, compositional video understanding.

HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

TL;DR

Abstract

HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (21)