Table of Contents
Fetching ...

Physical Depth-aware Early Accident Anticipation: A Multi-dimensional Visual Feature Fusion Framework

Hongpu Huang, Wei Zhou, Chen Wang

TL;DR

The paper tackles early accident anticipation from dashcam video by introducing a physical depth-aware framework that injects monocular depth features from Depth-Anything into a multi-view fusion of depth, interaction, and dynamic cues. It builds a four-module pipeline (Visual Depth Feature Extraction, Visual Interaction Feature Extraction, Visual Dynamic Feature Extraction, and Spatio-Temporal Feature Extraction) that fuses per-frame depth, interaction, and dynamics into a frame graph processed by a Graph Attention Network to predict accident probability. A key innovation is the occlusion-aware reconstruction adjacency that preserves spatio-temporal continuity, enabling state-of-the-art AP and strong AUC on the DAD dataset, with robust results on CCD and A3D as well. The approach reduces false positives due to perspective distortion and improves lead time for warnings, highlighting the practical impact of depth-aware multi-dimensional feature fusion for intelligent vehicle safety systems.

Abstract

Early accident anticipation from dashcam videos is a highly desirable yet challenging task for improving the safety of intelligent vehicles. Existing advanced accident anticipation approaches commonly model the interaction among traffic agents (e.g., vehicles, pedestrians, etc.) in the coarse 2D image space, which may not adequately capture their true positions and interactions. To address this limitation, we propose a physical depth-aware learning framework that incorporates the monocular depth features generated by a large model named Depth-Anything to introduce more fine-grained spatial 3D information. Furthermore, the proposed framework also integrates visual interaction features and visual dynamic features from traffic scenes to provide a more comprehensive perception towards the scenes. Based on these multi-dimensional visual features, the framework captures early indicators of accidents through the analysis of interaction relationships between objects in sequential frames. Additionally, the proposed framework introduces a reconstruction adjacency matrix for key traffic participants that are occluded, mitigating the impact of occluded objects on graph learning and maintaining the spatio-temporal continuity. Experimental results on public datasets show that the proposed framework attains state-of-the-art performance, highlighting the effectiveness of incorporating visual depth features and the superiority of the proposed framework.

Physical Depth-aware Early Accident Anticipation: A Multi-dimensional Visual Feature Fusion Framework

TL;DR

The paper tackles early accident anticipation from dashcam video by introducing a physical depth-aware framework that injects monocular depth features from Depth-Anything into a multi-view fusion of depth, interaction, and dynamic cues. It builds a four-module pipeline (Visual Depth Feature Extraction, Visual Interaction Feature Extraction, Visual Dynamic Feature Extraction, and Spatio-Temporal Feature Extraction) that fuses per-frame depth, interaction, and dynamics into a frame graph processed by a Graph Attention Network to predict accident probability. A key innovation is the occlusion-aware reconstruction adjacency that preserves spatio-temporal continuity, enabling state-of-the-art AP and strong AUC on the DAD dataset, with robust results on CCD and A3D as well. The approach reduces false positives due to perspective distortion and improves lead time for warnings, highlighting the practical impact of depth-aware multi-dimensional feature fusion for intelligent vehicle safety systems.

Abstract

Early accident anticipation from dashcam videos is a highly desirable yet challenging task for improving the safety of intelligent vehicles. Existing advanced accident anticipation approaches commonly model the interaction among traffic agents (e.g., vehicles, pedestrians, etc.) in the coarse 2D image space, which may not adequately capture their true positions and interactions. To address this limitation, we propose a physical depth-aware learning framework that incorporates the monocular depth features generated by a large model named Depth-Anything to introduce more fine-grained spatial 3D information. Furthermore, the proposed framework also integrates visual interaction features and visual dynamic features from traffic scenes to provide a more comprehensive perception towards the scenes. Based on these multi-dimensional visual features, the framework captures early indicators of accidents through the analysis of interaction relationships between objects in sequential frames. Additionally, the proposed framework introduces a reconstruction adjacency matrix for key traffic participants that are occluded, mitigating the impact of occluded objects on graph learning and maintaining the spatio-temporal continuity. Experimental results on public datasets show that the proposed framework attains state-of-the-art performance, highlighting the effectiveness of incorporating visual depth features and the superiority of the proposed framework.

Paper Structure

This paper contains 28 sections, 17 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) A key frame depicting a traffic accident, where the red vehicles are involved in the accident, while the green vehicles are overlapping in position but not involved in the accident; (b) The visualized depth map of the key frame; (c) The vehicles involved in the accident intersect in the depth dimension. (d) The vehicles not involved in the accident overlap in position but do not intersect in the depth dimension.
  • Figure 2: The overview of the proposed framework. Three types of visual features are obtained through pre-trained encoders. Visual depth feature represents the depth characteristics in each video frame, visual interaction feature captures the interaction relationships between traffic agents, and visual dynamic feature denotes the spatiotemporal dynamic changes in the video sequence. These visual features are then fused and passed to the graph attention layer and FC layers to generate the final prediction probability. The lock symbol indicates that the module is frozen during training.
  • Figure 3: Public dashcam video datasets, (a) DAD, (b) CCD and (c) A3D.
  • Figure 4: The visualization results of the proposed physical depth-aware framework, GSC, and UString on the same positive video sequence.
  • Figure 5: The visualization results of the proposed physical depth-aware framework and GSC on the same negative sample video sequence, where (a) shows the proposed framework identifying the video as True Negative. (b) shows GSC identifying the video as False Positive.
  • ...and 2 more figures