Table of Contents
Fetching ...

CRASH: Crash Recognition and Anticipation System Harnessing with Context-Aware and Temporal Focus Attentions

Haicheng Liao, Haoyu Sun, Huanming Shen, Chengyue Wang, Kahou Tam, Chunlin Tian, Li Li, Chengzhong Xu, Zhenning Li

TL;DR

The paper tackles accident anticipation for autonomous driving by fusing local object interactions with global scene context. It introduces five components—object detector, feature extractor, Object Focus Attention (OFA), Context-aware module with FFT and Context-aware Attention Blocks (CAB), and Temporal Focus Attention (TFA) for multi-layer fusion—to predict accidents as early as possible. Key contributions include a dual-path context modeling in both spatial and spectral domains, a multi-task loss with uncertainty balancing, and robustness evaluations on augmented datasets with missing data, showing superior AP and mTTA over SOTA baselines. The framework advances practical accident prediction by leveraging broad scene cues and dynamic frame-level attention, offering improved timeliness and reliability for real-world autonomous driving systems.

Abstract

Accurately and promptly predicting accidents among surrounding traffic agents from camera footage is crucial for the safety of autonomous vehicles (AVs). This task presents substantial challenges stemming from the unpredictable nature of traffic accidents, their long-tail distribution, the intricacies of traffic scene dynamics, and the inherently constrained field of vision of onboard cameras. To address these challenges, this study introduces a novel accident anticipation framework for AVs, termed CRASH. It seamlessly integrates five components: object detector, feature extractor, object-aware module, context-aware module, and multi-layer fusion. Specifically, we develop the object-aware module to prioritize high-risk objects in complex and ambiguous environments by calculating the spatial-temporal relationships between traffic agents. In parallel, the context-aware is also devised to extend global visual information from the temporal to the frequency domain using the Fast Fourier Transform (FFT) and capture fine-grained visual features of potential objects and broader context cues within traffic scenes. To capture a wider range of visual cues, we further propose a multi-layer fusion that dynamically computes the temporal dependencies between different scenes and iteratively updates the correlations between different visual features for accurate and timely accident prediction. Evaluated on real-world datasets--Dashcam Accident Dataset (DAD), Car Crash Dataset (CCD), and AnAn Accident Detection (A3D) datasets--our model surpasses existing top baselines in critical evaluation metrics like Average Precision (AP) and mean Time-To-Accident (mTTA). Importantly, its robustness and adaptability are particularly evident in challenging driving scenarios with missing or limited training data, demonstrating significant potential for application in real-world autonomous driving systems.

CRASH: Crash Recognition and Anticipation System Harnessing with Context-Aware and Temporal Focus Attentions

TL;DR

The paper tackles accident anticipation for autonomous driving by fusing local object interactions with global scene context. It introduces five components—object detector, feature extractor, Object Focus Attention (OFA), Context-aware module with FFT and Context-aware Attention Blocks (CAB), and Temporal Focus Attention (TFA) for multi-layer fusion—to predict accidents as early as possible. Key contributions include a dual-path context modeling in both spatial and spectral domains, a multi-task loss with uncertainty balancing, and robustness evaluations on augmented datasets with missing data, showing superior AP and mTTA over SOTA baselines. The framework advances practical accident prediction by leveraging broad scene cues and dynamic frame-level attention, offering improved timeliness and reliability for real-world autonomous driving systems.

Abstract

Accurately and promptly predicting accidents among surrounding traffic agents from camera footage is crucial for the safety of autonomous vehicles (AVs). This task presents substantial challenges stemming from the unpredictable nature of traffic accidents, their long-tail distribution, the intricacies of traffic scene dynamics, and the inherently constrained field of vision of onboard cameras. To address these challenges, this study introduces a novel accident anticipation framework for AVs, termed CRASH. It seamlessly integrates five components: object detector, feature extractor, object-aware module, context-aware module, and multi-layer fusion. Specifically, we develop the object-aware module to prioritize high-risk objects in complex and ambiguous environments by calculating the spatial-temporal relationships between traffic agents. In parallel, the context-aware is also devised to extend global visual information from the temporal to the frequency domain using the Fast Fourier Transform (FFT) and capture fine-grained visual features of potential objects and broader context cues within traffic scenes. To capture a wider range of visual cues, we further propose a multi-layer fusion that dynamically computes the temporal dependencies between different scenes and iteratively updates the correlations between different visual features for accurate and timely accident prediction. Evaluated on real-world datasets--Dashcam Accident Dataset (DAD), Car Crash Dataset (CCD), and AnAn Accident Detection (A3D) datasets--our model surpasses existing top baselines in critical evaluation metrics like Average Precision (AP) and mean Time-To-Accident (mTTA). Importantly, its robustness and adaptability are particularly evident in challenging driving scenarios with missing or limited training data, demonstrating significant potential for application in real-world autonomous driving systems.
Paper Structure (14 sections, 13 equations, 3 figures, 6 tables)

This paper contains 14 sections, 13 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overall framework of CRASH (a) and the architecture of Temporal Focus Attention (b).
  • Figure 2: Attention weights of hidden states over all TFA blocks in 8 TFA layers.
  • Figure 3: Qualitative Results of CRASH in rainy weather (a) and low nighttime lighting (b), heavy fog (c), and dense multi-agent traffic scenes (d) on the DAD dataset. The orange bar graph represents the loss of video data for that frame.