Table of Contents
Fetching ...

EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos

Fumihiko Tsuchiya, Taiki Miyanishi, Mahiro Ukai, Nakamasa Inoue, Shuhei Kurita, Yusuke Iwasawa, Yutaka Matsuo

Abstract

Counting in long videos remains a fundamental yet underexplored challenge in computer vision. Real-world recordings often span tens of minutes or longer and contain sparse, diverse events, making long-range temporal reasoning particularly difficult. However, most existing video counting benchmarks focus on short clips and evaluate only the final numerical answer, providing little insight into what should be counted or whether models consistently identify relevant instances across time. We introduce EC-Bench, a benchmark that jointly evaluates enumeration, counting, and temporal evidence grounding in long-form videos. EC-Bench contains 152 videos longer than 30 minutes and 1,699 queries paired with explicit evidence spans. Across 22 multimodal large language models (MLLMs), the best model achieves only 29.98% accuracy on Enumeration and 23.74% on Counting, while human performance reaches 78.57% and 82.97%, respectively. Our analysis reveals strong relationships between enumeration accuracy, temporal grounding, and counting performance. These results highlight fundamental limitations of current MLLMs and establish EC-Bench as a challenging benchmark for long-form quantitative video reasoning.

EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos

Abstract

Counting in long videos remains a fundamental yet underexplored challenge in computer vision. Real-world recordings often span tens of minutes or longer and contain sparse, diverse events, making long-range temporal reasoning particularly difficult. However, most existing video counting benchmarks focus on short clips and evaluate only the final numerical answer, providing little insight into what should be counted or whether models consistently identify relevant instances across time. We introduce EC-Bench, a benchmark that jointly evaluates enumeration, counting, and temporal evidence grounding in long-form videos. EC-Bench contains 152 videos longer than 30 minutes and 1,699 queries paired with explicit evidence spans. Across 22 multimodal large language models (MLLMs), the best model achieves only 29.98% accuracy on Enumeration and 23.74% on Counting, while human performance reaches 78.57% and 82.97%, respectively. Our analysis reveals strong relationships between enumeration accuracy, temporal grounding, and counting performance. These results highlight fundamental limitations of current MLLMs and establish EC-Bench as a challenging benchmark for long-form quantitative video reasoning.

Paper Structure

This paper contains 29 sections, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Example tasks from EC-Bench. The benchmark evaluates two abilities for long-form video understanding: enumeration, which identifies all relevant instances (e.g., listing brands or flavors), and counting, which aggregates occurrences of instances across long temporal contexts. Questions are grounded in sparse evidence spans within videos longer than 30 minutes.
  • Figure 2: Reasoning categories in EC-Bench. Examples of six quantitative reasoning types evaluated in the benchmark.
  • Figure 3: Task statistics of EC-Bench. Top: reasoning category distribution for Enumeration (left) and Counting (right). Bottom: answer distributions for Enumeration (left) and Counting (right).
  • Figure 4: Video statistics of EC-Bench. Left: distribution of video durations (all videos exceed 30 minutes; median 47 minutes). Right: representative examples of video genres in the dataset.
  • Figure 5: Video length comparison with the existing dataset.
  • ...and 10 more figures