CrashFormer: A Multimodal Architecture to Predict the Risk of Crash

Amin Karimi Monsefi; Pouya Shiri; Ahmad Mohammadshirazi; Nastaran Karimi Monsefi; Ron Davies; Sobhan Moosavi; Rajiv Ramnath

CrashFormer: A Multimodal Architecture to Predict the Risk of Crash

Amin Karimi Monsefi, Pouya Shiri, Ahmad Mohammadshirazi, Nastaran Karimi Monsefi, Ron Davies, Sobhan Moosavi, Rajiv Ramnath

TL;DR

CrashFormer tackles the problem of fine-grained accident risk prediction by integrating multi-source data — historical accidents, weather events, map imagery, and demographics — within hexagonal regions of about $5.161$ square kilometers at a $6$-hour cadence. It enacts a multi-branch architecture: a FEDFormer-based sequential encoder for time-series data, a VAN-based image encoder for map visuals, and a demographic encoder, all fused before a classifier to output binary risk. The approach demonstrates tangible gains over state-of-the-art baselines across $10$ US cities, with ablation studies showing the added value of map imagery and demographic information, and robust performance under spatial sparsity. This work advances practical, high-resolution accident risk forecasting and informs targeted safety interventions using readily accessible public data. It also highlights the benefits of combining temporal, spatial, and socioeconomic signals for traffic safety analytics, potentially enabling proactive policies and safer urban planning.

Abstract

Reducing traffic accidents is a crucial global public safety concern. Accident prediction is key to improving traffic safety, enabling proactive measures to be taken before a crash occurs, and informing safety policies, regulations, and targeted interventions. Despite numerous studies on accident prediction over the past decades, many have limitations in terms of generalizability, reproducibility, or feasibility for practical use due to input data or problem formulation. To address existing shortcomings, we propose CrashFormer, a multi-modal architecture that utilizes comprehensive (but relatively easy to obtain) inputs such as the history of accidents, weather information, map images, and demographic information. The model predicts the future risk of accidents on a reasonably acceptable cadence (i.e., every six hours) for a geographical location of 5.161 square kilometers. CrashFormer is composed of five components: a sequential encoder to utilize historical accidents and weather data, an image encoder to use map imagery data, a raw data encoder to utilize demographic information, a feature fusion module for aggregating the encoded features, and a classifier that accepts the aggregated data and makes predictions accordingly. Results from extensive real-world experiments in 10 major US cities show that CrashFormer outperforms state-of-the-art sequential and non-sequential models by 1.8% in F1-score on average when using ``sparse'' input data.

CrashFormer: A Multimodal Architecture to Predict the Risk of Crash

TL;DR

square kilometers at a

-hour cadence. It enacts a multi-branch architecture: a FEDFormer-based sequential encoder for time-series data, a VAN-based image encoder for map visuals, and a demographic encoder, all fused before a classifier to output binary risk. The approach demonstrates tangible gains over state-of-the-art baselines across

US cities, with ablation studies showing the added value of map imagery and demographic information, and robust performance under spatial sparsity. This work advances practical, high-resolution accident risk forecasting and informs targeted safety interventions using readily accessible public data. It also highlights the benefits of combining temporal, spatial, and socioeconomic signals for traffic safety analytics, potentially enabling proactive policies and safer urban planning.

Abstract

Paper Structure (28 sections, 3 figures, 3 tables)

This paper contains 28 sections, 3 figures, 3 tables.

Introduction
Related Work
Dataset
Accident History
Weather events
Demographics
Map Images
Research Question
Methodology
Sequential Feature Vector
Map Image Representation
Demographic Representation
CrashFormer
Historical Event Encoder
Image Encoder
...and 13 more sections

Figures (3)

Figure 1: The architecture of CrashFormer. The sequential data based on accident and weather information, along with map images and demographic data, are each fed to a separate component, and the output feature vectors are concatenated.
Figure 2: Comparing $CrashFormer$ to baselines. $F1\_1$ denotes $F1\_score$ for label one (high accident risk).
Figure 3: Comparing $CrashFormer$' with the baselines to test the impact of spatial sparsity on accident prediction in Houston (TX)

Theorems & Definitions (1)

definition 1: Geographic Region

CrashFormer: A Multimodal Architecture to Predict the Risk of Crash

TL;DR

Abstract

CrashFormer: A Multimodal Architecture to Predict the Risk of Crash

Authors

TL;DR

Abstract

Table of Contents

Figures (3)

Theorems & Definitions (1)