Table of Contents
Fetching ...

Abductive Ego-View Accident Video Understanding for Safe Driving Perception

Jianwu Fang, Lei-lei Li, Junfei Zhou, Junbin Xiao, Hongkai Yu, Chen Lv, Jianru Xue, Tat-Seng Chua

TL;DR

This work presents an Abductive accident Video unders tanding framework for Safe Driving perception (AdVersa-SD), and extensive experiments verify the abductive ability of AdVersa-SD and the superiority of OAVD against the state-of-the-art diffusion models.

Abstract

We present MM-AU, a novel dataset for Multi-Modal Accident video Understanding. MM-AU contains 11,727 in-the-wild ego-view accident videos, each with temporally aligned text descriptions. We annotate over 2.23 million object boxes and 58,650 pairs of video-based accident reasons, covering 58 accident categories. MM-AU supports various accident understanding tasks, particularly multimodal video diffusion to understand accident cause-effect chains for safe driving. With MM-AU, we present an Abductive accident Video understanding framework for Safe Driving perception (AdVersa-SD). AdVersa-SD performs video diffusion via an Object-Centric Video Diffusion (OAVD) method which is driven by an abductive CLIP model. This model involves a contrastive interaction loss to learn the pair co-occurrence of normal, near-accident, accident frames with the corresponding text descriptions, such as accident reasons, prevention advice, and accident categories. OAVD enforces the causal region learning while fixing the content of the original frame background in video generation, to find the dominant cause-effect chain for certain accidents. Extensive experiments verify the abductive ability of AdVersa-SD and the superiority of OAVD against the state-of-the-art diffusion models. Additionally, we provide careful benchmark evaluations for object detection and accident reason answering since AdVersa-SD relies on precise object and accident reason information.

Abductive Ego-View Accident Video Understanding for Safe Driving Perception

TL;DR

This work presents an Abductive accident Video unders tanding framework for Safe Driving perception (AdVersa-SD), and extensive experiments verify the abductive ability of AdVersa-SD and the superiority of OAVD against the state-of-the-art diffusion models.

Abstract

We present MM-AU, a novel dataset for Multi-Modal Accident video Understanding. MM-AU contains 11,727 in-the-wild ego-view accident videos, each with temporally aligned text descriptions. We annotate over 2.23 million object boxes and 58,650 pairs of video-based accident reasons, covering 58 accident categories. MM-AU supports various accident understanding tasks, particularly multimodal video diffusion to understand accident cause-effect chains for safe driving. With MM-AU, we present an Abductive accident Video understanding framework for Safe Driving perception (AdVersa-SD). AdVersa-SD performs video diffusion via an Object-Centric Video Diffusion (OAVD) method which is driven by an abductive CLIP model. This model involves a contrastive interaction loss to learn the pair co-occurrence of normal, near-accident, accident frames with the corresponding text descriptions, such as accident reasons, prevention advice, and accident categories. OAVD enforces the causal region learning while fixing the content of the original frame background in video generation, to find the dominant cause-effect chain for certain accidents. Extensive experiments verify the abductive ability of AdVersa-SD and the superiority of OAVD against the state-of-the-art diffusion models. Additionally, we provide careful benchmark evaluations for object detection and accident reason answering since AdVersa-SD relies on precise object and accident reason information.
Paper Structure (16 sections, 4 equations, 14 figures, 8 tables)

This paper contains 16 sections, 4 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: The ego-view multimodality accident video understanding tasks that MM-AU can support, where we highlight the text descriptions for accident reason ($t_r$), prevention advice ($t_p$), and accident category ($t_a$), as well as temporal windows (accident-free, near-accident, and accidentwindows) for different tasks.
  • Figure 2: Some samples of object annotation in MM-AU.
  • Figure 2: The object detection snapshots in accident frames by CenterNet law2018cornernet, DETR detr2020, DiffusionDet chen2023diffusiondet, and YOLOv5sglenn_jocher_2022_7002879. We can see that all detectors fail to detect the cyclist (column (2)) and the pedestrian with distorted posture (column (1)). DETR is more active for covering all possible objects while many false detections are generated.
  • Figure 3: The annotation attribute statistics in MM-AU for the temporal, object, and text annotations. Better viewed in zoomed-in mode.
  • Figure 3: The case visualization of Accident reason Answering (ArA) by 8 state-of-the-art Video Question Answering (VQA) methods.
  • ...and 9 more figures