Table of Contents
Fetching ...

EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs

Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xiangyang Ji

TL;DR

EventBench introduces a fully open, multi-dimensional benchmark for event-based MLLMs, combining 8 task metrics with a large-scale event-stream dataset ($112{,}000$ samples) and a training corpus of over $1{,}000{,}000$ event-text pairs ($EQA-1.4M$) to enable fair, scalable evaluation. It spans eight tasks across understanding, recognition, and spatial reasoning, sourced from more than 20 real-world and synthetic datasets and evaluated through an open pipeline. The study shows event-based MLLMs excel at general event understanding but still lag in fine-grained recognition and 3D spatial reasoning, with dynamic event binning (EventGPT+) improving long-horizon processing. Overall, EventBench provides a standardized platform to benchmark, compare, and drive advances in event-based multimodal learning, informing future work on modality adaptation and scalable event processing.

Abstract

Multimodal large language models (MLLMs) have made significant advancements in event-based vision, yet the comprehensive evaluation of their capabilities within a unified benchmark remains largely unexplored. In this work, we introduce EventBench, a benchmark that offers eight diverse task metrics together with a large-scale event stream dataset. EventBench differs from existing event-based benchmarks in four key aspects: (1) openness in accessibility, releasing all raw event streams and task instructions across eight evaluation metrics; (2) diversity in task coverage, spanning understanding, recognition, and spatial reasoning tasks for comprehensive capability assessment; (3) integration in spatial dimensions, pioneering the design of 3D spatial reasoning tasks for event-based MLLMs; and (4) scale in data volume, with an accompanying training set of over one million event-text pairs supporting large-scale training and evaluation. Using EventBench, we evaluate state-of-the-art closed-source models such as GPT-5 and Gemini-2.5 Pro, leading open-source models including Qwen2.5-VL and InternVL3, and event-based MLLMs such as EventGPT that directly process raw event streams. Extensive evaluation reveals that while current event-based MLLMs demonstrate strong performance in event stream understanding, they continue to struggle with fine-grained recognition and spatial reasoning.

EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs

TL;DR

EventBench introduces a fully open, multi-dimensional benchmark for event-based MLLMs, combining 8 task metrics with a large-scale event-stream dataset ( samples) and a training corpus of over event-text pairs () to enable fair, scalable evaluation. It spans eight tasks across understanding, recognition, and spatial reasoning, sourced from more than 20 real-world and synthetic datasets and evaluated through an open pipeline. The study shows event-based MLLMs excel at general event understanding but still lag in fine-grained recognition and 3D spatial reasoning, with dynamic event binning (EventGPT+) improving long-horizon processing. Overall, EventBench provides a standardized platform to benchmark, compare, and drive advances in event-based multimodal learning, informing future work on modality adaptation and scalable event processing.

Abstract

Multimodal large language models (MLLMs) have made significant advancements in event-based vision, yet the comprehensive evaluation of their capabilities within a unified benchmark remains largely unexplored. In this work, we introduce EventBench, a benchmark that offers eight diverse task metrics together with a large-scale event stream dataset. EventBench differs from existing event-based benchmarks in four key aspects: (1) openness in accessibility, releasing all raw event streams and task instructions across eight evaluation metrics; (2) diversity in task coverage, spanning understanding, recognition, and spatial reasoning tasks for comprehensive capability assessment; (3) integration in spatial dimensions, pioneering the design of 3D spatial reasoning tasks for event-based MLLMs; and (4) scale in data volume, with an accompanying training set of over one million event-text pairs supporting large-scale training and evaluation. Using EventBench, we evaluate state-of-the-art closed-source models such as GPT-5 and Gemini-2.5 Pro, leading open-source models including Qwen2.5-VL and InternVL3, and event-based MLLMs such as EventGPT that directly process raw event streams. Extensive evaluation reveals that while current event-based MLLMs demonstrate strong performance in event stream understanding, they continue to struggle with fine-grained recognition and spatial reasoning.

Paper Structure

This paper contains 18 sections, 4 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Our EventBench is a publicly accessible and comprehensive evaluation benchmark for event-based MLLMs. It offers diverse task metrics across multiple dimensions (i.e., understanding, recognition, and spatial reasoning) and will accelerate research on event-based MLLMs in challenging scenarios.
  • Figure 2: The comprehensive EventBench covers 8 diverse task metrics for systematically evaluating the capabilities of event-based MLLMs. These metrics can be broadly categorized into three types: understanding (i.e., detailed understanding), recognition (i.e., action recognition, gesture recognition, and event OCR), and spatial reasoning (i.e., spatial relationship, absolute distance, and object counting).
  • Figure 3: Data statistics of our EventBench. (a) Task and category distribution across three groups: understanding (i.e., DU and CR), recognition (i.e., AR, GR, and E-OCR), and spatial reasoning (i.e., SR, AD, and OC). (b) Sample statistics for each task category. (c) Comparison with existing event-based benchmarks across multiple dimensions (i.e., modality, metric type, data source, size, and temporal span). Note that EventBench provides a comprehensive benchmark for systematically evaluating the capabilities of event-based MLLMs.
  • Figure 4: Data statistics in our EventBench across multiple dimensions. (a) Distribution of synthetic and real-world datasets. (b) Proportions of event streams by temporal length. (c)-(d) Distributions of instruction types and multiple-choice options.
  • Figure 5: Overall performance improvement on EventBench after SFT training. (a) Model-wise improvement with a $25.2$% maximum gain. (b) Task-wise improvement of Qwen2.5-VL across spatial reasoning, downstream, and open-domain QA tasks.
  • ...and 4 more figures