EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs
Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xiangyang Ji
TL;DR
EventBench introduces a fully open, multi-dimensional benchmark for event-based MLLMs, combining 8 task metrics with a large-scale event-stream dataset ($112{,}000$ samples) and a training corpus of over $1{,}000{,}000$ event-text pairs ($EQA-1.4M$) to enable fair, scalable evaluation. It spans eight tasks across understanding, recognition, and spatial reasoning, sourced from more than 20 real-world and synthetic datasets and evaluated through an open pipeline. The study shows event-based MLLMs excel at general event understanding but still lag in fine-grained recognition and 3D spatial reasoning, with dynamic event binning (EventGPT+) improving long-horizon processing. Overall, EventBench provides a standardized platform to benchmark, compare, and drive advances in event-based multimodal learning, informing future work on modality adaptation and scalable event processing.
Abstract
Multimodal large language models (MLLMs) have made significant advancements in event-based vision, yet the comprehensive evaluation of their capabilities within a unified benchmark remains largely unexplored. In this work, we introduce EventBench, a benchmark that offers eight diverse task metrics together with a large-scale event stream dataset. EventBench differs from existing event-based benchmarks in four key aspects: (1) openness in accessibility, releasing all raw event streams and task instructions across eight evaluation metrics; (2) diversity in task coverage, spanning understanding, recognition, and spatial reasoning tasks for comprehensive capability assessment; (3) integration in spatial dimensions, pioneering the design of 3D spatial reasoning tasks for event-based MLLMs; and (4) scale in data volume, with an accompanying training set of over one million event-text pairs supporting large-scale training and evaluation. Using EventBench, we evaluate state-of-the-art closed-source models such as GPT-5 and Gemini-2.5 Pro, leading open-source models including Qwen2.5-VL and InternVL3, and event-based MLLMs such as EventGPT that directly process raw event streams. Extensive evaluation reveals that while current event-based MLLMs demonstrate strong performance in event stream understanding, they continue to struggle with fine-grained recognition and spatial reasoning.
