Table of Contents
Fetching ...

AccidentBench: Benchmarking Multimodal Understanding and Reasoning in Vehicle Accidents and Beyond

Shangding Gu, Xiaohan Wang, Donghao Ying, Haoyu Zhao, Runing Yang, Ming Jin, Boyi Li, Marco Pavone, Serena Yeung-Levy, Jun Wang, Dawn Song, Costas Spanos

TL;DR

AccidentBench addresses the lack of unified, safety-critical multimodal evaluation across land, air, and maritime domains by assembling ~2000 videos and >19,000 QA pairs to probe temporal, spatial, and intent reasoning under real-world conditions. The benchmark offers two task formats (interval-based and accuracy-based) and diverse scenario settings (vehicle accidents, airspace, ship motion), enabling rigorous evaluation of state-of-the-art models. Key findings show even top models struggle on hard, long-horizon tasks, underscoring gaps in safety-critical temporal-spatial-intent reasoning and motivating targeted improvements. By providing a large-scale, physically grounded testbed and accompanying code, AccidentBench aims to drive the development of safer, more robust multimodal systems for real-world safety-critical applications.

Abstract

Rapid advances in multimodal models demand benchmarks that rigorously evaluate understanding and reasoning in safety-critical, dynamic real-world settings. We present AccidentBench, a large-scale benchmark that combines vehicle accident scenarios with Beyond domains, safety-critical settings in air and water that emphasize spatial and temporal reasoning (e.g., navigation, orientation, multi-vehicle motion). The benchmark contains approximately 2000 videos and over 19000 human-annotated question--answer pairs spanning multiple video lengths (short/medium/long) and difficulty levels (easy/medium/hard). Tasks systematically probe core capabilities: temporal, spatial, and intent understanding and reasoning. By unifying accident-centric traffic scenes with broader safety-critical scenarios in air and water, AccidentBench offers a comprehensive, physically grounded testbed for evaluating models under real-world variability. Evaluations of state-of-the-art models (e.g., Gemini-2.5 Pro and GPT-5) show that even the strongest models achieve only about 18% accuracy on the hardest tasks and longest videos, revealing substantial gaps in real-world temporal, spatial, and intent reasoning. AccidentBench is designed to expose these critical gaps and drive the development of multimodal models that are safer, more robust, and better aligned with real-world safety-critical challenges. The code and dataset are available at: https://github.com/SafeRL-Lab/AccidentBench

AccidentBench: Benchmarking Multimodal Understanding and Reasoning in Vehicle Accidents and Beyond

TL;DR

AccidentBench addresses the lack of unified, safety-critical multimodal evaluation across land, air, and maritime domains by assembling ~2000 videos and >19,000 QA pairs to probe temporal, spatial, and intent reasoning under real-world conditions. The benchmark offers two task formats (interval-based and accuracy-based) and diverse scenario settings (vehicle accidents, airspace, ship motion), enabling rigorous evaluation of state-of-the-art models. Key findings show even top models struggle on hard, long-horizon tasks, underscoring gaps in safety-critical temporal-spatial-intent reasoning and motivating targeted improvements. By providing a large-scale, physically grounded testbed and accompanying code, AccidentBench aims to drive the development of safer, more robust multimodal systems for real-world safety-critical applications.

Abstract

Rapid advances in multimodal models demand benchmarks that rigorously evaluate understanding and reasoning in safety-critical, dynamic real-world settings. We present AccidentBench, a large-scale benchmark that combines vehicle accident scenarios with Beyond domains, safety-critical settings in air and water that emphasize spatial and temporal reasoning (e.g., navigation, orientation, multi-vehicle motion). The benchmark contains approximately 2000 videos and over 19000 human-annotated question--answer pairs spanning multiple video lengths (short/medium/long) and difficulty levels (easy/medium/hard). Tasks systematically probe core capabilities: temporal, spatial, and intent understanding and reasoning. By unifying accident-centric traffic scenes with broader safety-critical scenarios in air and water, AccidentBench offers a comprehensive, physically grounded testbed for evaluating models under real-world variability. Evaluations of state-of-the-art models (e.g., Gemini-2.5 Pro and GPT-5) show that even the strongest models achieve only about 18% accuracy on the hardest tasks and longest videos, revealing substantial gaps in real-world temporal, spatial, and intent reasoning. AccidentBench is designed to expose these critical gaps and drive the development of multimodal models that are safer, more robust, and better aligned with real-world safety-critical challenges. The code and dataset are available at: https://github.com/SafeRL-Lab/AccidentBench

Paper Structure

This paper contains 27 sections, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Examples of multimodal understanding and reasoning in vehicle accident and other safety-critical scenarios.
  • Figure 2: Land-space traffic accident scenarios for open-space video understanding and reasoning include intersection collisions, urban road accidents, nighttime incidents, rural road accidents, snow-covered road collisions, and freeway accidents.
  • Figure 3: Examples of question settings in AccidentBench across three key understanding and reasoning types: Temporal Understanding and Reasoning, which involves understanding event sequences and motion over time; Spatial Understanding and Reasoning, which focuses on relative positioning and orientation in space; and Intent Understanding and Reasoning, which evaluates understanding of goal-directed behaviors and decision-making in dynamic environments.
  • Figure 4: Qualitative error analysis of SOTA multimodal models (Gemini 2.5 and GPT-4o) on the AccidentBench benchmark. Each example illustrates a failure case in a different reasoning category: spatial reasoning (left), temporal reasoning (middle), and intent reasoning (right). Despite their capabilities, both models struggle with spatial localization, counting dynamic objects, and understanding goal-directed motion in real-world safety-critical scenarios.
  • Figure 5: A question and answer example: For each each scenario reasoning setting, we include three types of video lengths: short, medium, and long. Each video length includes tasks designed to evaluate temporal reasoning, spatial reasoning, and intent reasoning.
  • ...and 2 more figures