Table of Contents
Fetching ...

Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation

Ziniu Zhang, Minxuan Duan, Haris N. Koutsopoulos, Hongyang R. Zhang

TL;DR

This work tackles proactive road-safety analysis by integrating road-network structure with high-resolution satellite imagery and weather/traffic context. It builds a six-state, multimodal dataset containing nine million accidents and one million satellite images per node, and introduces a multimodal fusion framework that combines graph embeddings with vision-based embeddings through Basic, Gated, and MoE fusion. The approach achieves an average AUROC of 90.1%, a notable gain over graph-only models, and enables causal estimation of factors like precipitation, seasonality, and road type via embedding-based matching (ATT). The study also demonstrates cross-state transfer, ablation-driven insights into modality contributions, and releases the MMTraCE dataset to support broader research in multimodal transportation analytics. Overall, the work provides a scalable benchmark and practical methodology for combining visual and structural road data to predict accidents and inform safety interventions.

Abstract

We consider analyzing traffic accident patterns using both road network data and satellite images aligned to road graph nodes. Previous work for predicting accident occurrences relies primarily on road network structural features while overlooking physical and environmental information from the road surface and its surroundings. In this work, we construct a large multimodal dataset across six U.S. states, containing nine million traffic accident records from official sources, and one million high-resolution satellite images for each node of the road network. Additionally, every node is annotated with features such as the region's weather statistics and road type (e.g., residential vs. motorway), and each edge is annotated with traffic volume information (i.e., Average Annual Daily Traffic). Utilizing this dataset, we conduct a comprehensive evaluation of multimodal learning methods that integrate both visual and network embeddings. Our findings show that integrating both data modalities improves prediction accuracy, achieving an average AUROC of $90.1\%$, which is a $3.7\%$ gain over graph neural network models that only utilize graph structures. With the improved embeddings, we conduct a causal analysis based on a matching estimator to estimate the key contributing factors influencing traffic accidents. We find that accident rates rise by $24\%$ under higher precipitation, by $22\%$ on higher-speed roads such as motorways, and by $29\%$ due to seasonal patterns, after adjusting for other confounding factors. Ablation studies confirm that satellite imagery features are essential for achieving accurate prediction.

Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation

TL;DR

This work tackles proactive road-safety analysis by integrating road-network structure with high-resolution satellite imagery and weather/traffic context. It builds a six-state, multimodal dataset containing nine million accidents and one million satellite images per node, and introduces a multimodal fusion framework that combines graph embeddings with vision-based embeddings through Basic, Gated, and MoE fusion. The approach achieves an average AUROC of 90.1%, a notable gain over graph-only models, and enables causal estimation of factors like precipitation, seasonality, and road type via embedding-based matching (ATT). The study also demonstrates cross-state transfer, ablation-driven insights into modality contributions, and releases the MMTraCE dataset to support broader research in multimodal transportation analytics. Overall, the work provides a scalable benchmark and practical methodology for combining visual and structural road data to predict accidents and inform safety interventions.

Abstract

We consider analyzing traffic accident patterns using both road network data and satellite images aligned to road graph nodes. Previous work for predicting accident occurrences relies primarily on road network structural features while overlooking physical and environmental information from the road surface and its surroundings. In this work, we construct a large multimodal dataset across six U.S. states, containing nine million traffic accident records from official sources, and one million high-resolution satellite images for each node of the road network. Additionally, every node is annotated with features such as the region's weather statistics and road type (e.g., residential vs. motorway), and each edge is annotated with traffic volume information (i.e., Average Annual Daily Traffic). Utilizing this dataset, we conduct a comprehensive evaluation of multimodal learning methods that integrate both visual and network embeddings. Our findings show that integrating both data modalities improves prediction accuracy, achieving an average AUROC of , which is a gain over graph neural network models that only utilize graph structures. With the improved embeddings, we conduct a causal analysis based on a matching estimator to estimate the key contributing factors influencing traffic accidents. We find that accident rates rise by under higher precipitation, by on higher-speed roads such as motorways, and by due to seasonal patterns, after adjusting for other confounding factors. Ablation studies confirm that satellite imagery features are essential for achieving accurate prediction.

Paper Structure

This paper contains 36 sections, 19 equations, 9 figures, 11 tables, 1 algorithm.

Figures (9)

  • Figure 1: Example satellite images showing different types of roads. Each image is centered around a road network node and captures both the physical characteristics of the road, such as layout, width, and intersections, and the surrounding context, including vegetation, buildings, and terrain.
  • Figure 2: The proportion of different road types among six states' road networks. Residential roads account for the vast majority of the total, making up approximately $74.5\%$ of all roads. Other types, such as tertiary, secondary, and primary, contribute much smaller proportions by comparison.
  • Figure 3: Seasonal comparison of traffic accidents in Massachusetts. \ref{['fig_ma_spr']}, \ref{['fig_ma_win']}: Accident records in Massachusetts during spring and winter. It is evident that accident points are more densely distributed in winter, indicating a higher frequency of incidents likely due to adverse weather conditions. \ref{['fig_acc_total']}, \ref{['fig_acc_avg']}: Accident count of motorway (M), motorway link (M_L), primary (Pri), primary link (Pri_L), residential (Res), secondary (Sec), secondary link (Sec_L), tertiary (Ter), tertiary link (Ter_L), trunk (Tru), trunk link (Tru_L), living street, road, and trailhead. Figure \ref{['fig_acc_total']} gives the top-$10$ total count on different road types, while Figure \ref{['fig_acc_avg']} provides the top-$10$ average count on different road types.
  • Figure 4: Cross-state AUROC performance of the GIN + MoE model, computed over six states. Each entry shows the score when training on one state (represented by rows) and testing on another state (represented by columns). Darker colors indicate better transferability.
  • Figure 5: Accident records with different ranges of precipitation and traffic volume.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Remark 2.1