Table of Contents
Fetching ...

How good are deep learning methods for automated road safety analysis using video data? An experimental study

Qingwu Liu, Nicolas Saunier, Guillaume-Alexandre Bilodeau

TL;DR

This study evaluates three deep-learning MOT methods (MonoDTR, YoloStereo3D, CenterTrack) on the KITTI dataset to compute road-safety indicators, notably Time-to-Collision ($TTC$). A unified framework projects detections into BEV and derives $TTC$ from road-user trajectories, augmented by two post-processing steps, IDsplit and SS, to test sensitivity to tracking errors. Results show all methods over-estimate interactions and under-estimate $TTC$, making interactions appear more dangerous than in ground truth, withCenterTrack offering the best overall tracking but not consistently improving TTC accuracy. The findings suggest that current deep-learning tracking approaches, even when combined with stereo information, are insufficient for reliable TTC-based road-safety analysis, highlighting the need for more data (including roadside sensors) and additional metrics like PET to close the gap between automated analysis and actual safety risks.

Abstract

Image-based multi-object detection (MOD) and multi-object tracking (MOT) are advancing at a fast pace. A variety of 2D and 3D MOD and MOT methods have been developed for monocular and stereo cameras. Road safety analysis can benefit from those advancements. As crashes are rare events, surrogate measures of safety (SMoS) have been developed for safety analyses. (Semi-)Automated safety analysis methods extract road user trajectories to compute safety indicators, for example, Time-to-Collision (TTC) and Post-encroachment Time (PET). Inspired by the success of deep learning in MOD and MOT, we investigate three MOT methods, including one based on a stereo-camera, using the annotated KITTI traffic video dataset. Two post-processing steps, IDsplit and SS, are developed to improve the tracking results and investigate the factors influencing the TTC. The experimental results show that, despite some advantages in terms of the numbers of interactions or similarity to the TTC distributions, all the tested methods systematically over-estimate the number of interactions and under-estimate the TTC: they report more interactions and more severe interactions, making the road user interactions appear less safe than they are. Further efforts will be directed towards testing more methods and more data, in particular from roadside sensors, to verify the results and improve the performance.

How good are deep learning methods for automated road safety analysis using video data? An experimental study

TL;DR

This study evaluates three deep-learning MOT methods (MonoDTR, YoloStereo3D, CenterTrack) on the KITTI dataset to compute road-safety indicators, notably Time-to-Collision (). A unified framework projects detections into BEV and derives from road-user trajectories, augmented by two post-processing steps, IDsplit and SS, to test sensitivity to tracking errors. Results show all methods over-estimate interactions and under-estimate , making interactions appear more dangerous than in ground truth, withCenterTrack offering the best overall tracking but not consistently improving TTC accuracy. The findings suggest that current deep-learning tracking approaches, even when combined with stereo information, are insufficient for reliable TTC-based road-safety analysis, highlighting the need for more data (including roadside sensors) and additional metrics like PET to close the gap between automated analysis and actual safety risks.

Abstract

Image-based multi-object detection (MOD) and multi-object tracking (MOT) are advancing at a fast pace. A variety of 2D and 3D MOD and MOT methods have been developed for monocular and stereo cameras. Road safety analysis can benefit from those advancements. As crashes are rare events, surrogate measures of safety (SMoS) have been developed for safety analyses. (Semi-)Automated safety analysis methods extract road user trajectories to compute safety indicators, for example, Time-to-Collision (TTC) and Post-encroachment Time (PET). Inspired by the success of deep learning in MOD and MOT, we investigate three MOT methods, including one based on a stereo-camera, using the annotated KITTI traffic video dataset. Two post-processing steps, IDsplit and SS, are developed to improve the tracking results and investigate the factors influencing the TTC. The experimental results show that, despite some advantages in terms of the numbers of interactions or similarity to the TTC distributions, all the tested methods systematically over-estimate the number of interactions and under-estimate the TTC: they report more interactions and more severe interactions, making the road user interactions appear less safe than they are. Further efforts will be directed towards testing more methods and more data, in particular from roadside sensors, to verify the results and improve the performance.

Paper Structure

This paper contains 21 sections, 5 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: The safety pyramid, adapted from laureshyn2016review. F refers to the fatal crashes, I to injury crashes and PD to property damage only crashes.
  • Figure 2: Processing pipeline for deep learning-based road safety analysis. GT represents ground truth, CNN represents convolutional neural network girshick2014rich and ViT represents vision transformer dosovitskiy2020image. The examples for two object detection and tracking methods, that perform both steps either separately or jointly, are highlighted by the red ellipses and the blue ellipses, respectively. Details of the trajectory post-processing steps are illustrated in Figure \ref{['fig:Post_process']}.
  • Figure 3: Post-processing steps for road safety analysis.
  • Figure 4: Boxplots of the D-statistics for each method for all the sequences.
  • Figure 5: Boxplots of the absolute differences of the $TTC_{min}$ medians between each method and the ground truth for all the sequences.
  • ...and 5 more figures