Table of Contents
Fetching ...

Real-time Accident Anticipation for Autonomous Driving Through Monocular Depth-Enhanced 3D Modeling

Haicheng Liao, Yongkang Li, Chengyue Wang, Songning Lai, Zhenning Li, Zilin Bian, Jaeyoung Lee, Zhiyong Cui, Guohui Zhang, Chengzhong Xu

TL;DR

AccNet addresses real-time accident anticipation in autonomous driving by exploiting monocular depth cues to build 3D representations from dashcam videos. It introduces a 3D Collision Module with a depth-informed GNN topology and a Binary Adaptive Loss for Early Anticipation (BA-LEA), coupled with a Smooth Module and multitask learning to improve early, accurate predictions. Across DAD, CCD, A3D, and DADA-2000, AccNet achieves superior AP and mean Time-To-Accident (mTTA) compared to SOTA baselines, indicating strong practical benefits for ADAS and autonomous driving safety. The work also demonstrates robust performance through extensive ablations and provides insights into depth-enabled 3D scene understanding for crash anticipation.

Abstract

The primary goal of traffic accident anticipation is to foresee potential accidents in real time using dashcam videos, a task that is pivotal for enhancing the safety and reliability of autonomous driving technologies. In this study, we introduce an innovative framework, AccNet, which significantly advances the prediction capabilities beyond the current state-of-the-art (SOTA) 2D-based methods by incorporating monocular depth cues for sophisticated 3D scene modeling. Addressing the prevalent challenge of skewed data distribution in traffic accident datasets, we propose the Binary Adaptive Loss for Early Anticipation (BA-LEA). This novel loss function, together with a multi-task learning strategy, shifts the focus of the predictive model towards the critical moments preceding an accident. {We rigorously evaluate the performance of our framework on three benchmark datasets--Dashcam Accident Dataset (DAD), Car Crash Dataset (CCD), and AnAn Accident Detection (A3D), and DADA-2000 Dataset--demonstrating its superior predictive accuracy through key metrics such as Average Precision (AP) and mean Time-To-Accident (mTTA).

Real-time Accident Anticipation for Autonomous Driving Through Monocular Depth-Enhanced 3D Modeling

TL;DR

AccNet addresses real-time accident anticipation in autonomous driving by exploiting monocular depth cues to build 3D representations from dashcam videos. It introduces a 3D Collision Module with a depth-informed GNN topology and a Binary Adaptive Loss for Early Anticipation (BA-LEA), coupled with a Smooth Module and multitask learning to improve early, accurate predictions. Across DAD, CCD, A3D, and DADA-2000, AccNet achieves superior AP and mean Time-To-Accident (mTTA) compared to SOTA baselines, indicating strong practical benefits for ADAS and autonomous driving safety. The work also demonstrates robust performance through extensive ablations and provides insights into depth-enabled 3D scene understanding for crash anticipation.

Abstract

The primary goal of traffic accident anticipation is to foresee potential accidents in real time using dashcam videos, a task that is pivotal for enhancing the safety and reliability of autonomous driving technologies. In this study, we introduce an innovative framework, AccNet, which significantly advances the prediction capabilities beyond the current state-of-the-art (SOTA) 2D-based methods by incorporating monocular depth cues for sophisticated 3D scene modeling. Addressing the prevalent challenge of skewed data distribution in traffic accident datasets, we propose the Binary Adaptive Loss for Early Anticipation (BA-LEA). This novel loss function, together with a multi-task learning strategy, shifts the focus of the predictive model towards the critical moments preceding an accident. {We rigorously evaluate the performance of our framework on three benchmark datasets--Dashcam Accident Dataset (DAD), Car Crash Dataset (CCD), and AnAn Accident Detection (A3D), and DADA-2000 Dataset--demonstrating its superior predictive accuracy through key metrics such as Average Precision (AP) and mean Time-To-Accident (mTTA).
Paper Structure (24 sections, 11 equations, 4 figures, 6 tables)

This paper contains 24 sections, 11 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Illustration of Monocular Depth-Enhanced 3D Modeling in AccNet. Current methods (a) all rely on 2D bounding boxes identified by object detectors and typically resort to pixel-by-pixel distance computations to approximate motion interactions in driving scenarios, leading to inaccuracies due to the lack of depth information. In contrast, our AccNet model (b) extracts precise 3D coordinates of key traffic participants, including vehicles and pedestrians, from video and leverages monocular depth to calculate real-world distances. This method enhances accident detection by capturing detailed fine-grained correlations between visual images over time.
  • Figure 2: Overall architecture of the proposed AccNet. The feature extractor, object detector, and depth enhancer first generate the visual features, bounding boxes, and depth matrices respectively for the raw T-frame video sequence. These outputs are then fed into the proposed Context Attention, Object Attention, and 3D Collision modules to update the representation of object queries. The Temporal Attention mechanism then fuses the features from the three modalities. Finally, the Smooth Collision Module is introduced to balance multi-task learning, while the Accident Module predicts the likelihood of an accident occurring at each time step.
  • Figure 3: Comparison between the 3D Collision and the 2D Collision Modules. Each point in the figure represents the distribution of model performance in mTTA-AP two-dimensional space obtained by training every half epoch using either 2D Spatial GCN or 3D Spatial GCN. At the top and right of the mTTA-AP plot are the probability distributions of mTTA over the AP dimension and AP over the mTTA dimension, respectively. The figure highlights the experimental results showing that the highest AP and a balance between mTTA and AP are achieved with the two types of Spatial GCN.
  • Figure 4: Visualization of AccNet's performance in dense urban traffic (a-b) and scenes with low night lighting and rain (c), with the threshold uniformly set at 0.5. Scenes (a-b) are successful accident anticipations, while scene (c) represents a failure case for false positive anticipation.