Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding
Aaron Lohner, Francesco Compagno, Jonathan Francis, Alessandro Oltramari
TL;DR
The paper tackles traffic accident understanding for autonomous driving by representing traffic scenes as scene graphs and aligning this graph modality with vision and language encoders. It introduces Traffic-Scene-Graph Inference (TSGi), a four-stage pipeline that pre-processes videos, encodes SGs with a MRGCN-based encoder, and performs multimodal alignment with CLIP/X-CLIP before downstream classification. The study shows that SG embeddings provide substantial gains (∼13 percentage points above random) and can enhance vision-language classifiers, though gains from alignment depend on data and hyperparameters; results highlight the potential and limitations of incorporating scene graphs as a distinct modality. Overall, this work demonstrates the feasibility and value of integrating scene-graph representations into multimodal traffic analysis, with implications for more robust accident understanding in real-world driving systems.
Abstract
Recognizing a traffic accident is an essential part of any autonomous driving or road monitoring system. An accident can appear in a wide variety of forms, and understanding what type of accident is taking place may be useful to prevent it from recurring. This work focuses on classifying traffic scenes into specific accident types. We approach the problem by representing a traffic scene as a graph, where objects such as cars can be represented as nodes, and relative distances and directions between them as edges. This representation of a traffic scene is referred to as a scene graph, and can be used as input for an accident classifier. Better results are obtained with a classifier that fuses the scene graph input with visual and textual representations. This work introduces a multi-stage, multimodal pipeline that pre-processes videos of traffic accidents, encodes them as scene graphs, and aligns this representation with vision and language modalities before executing the classification task. When trained on 4 classes, our method achieves a balanced accuracy score of 57.77% on an (unbalanced) subset of the popular Detection of Traffic Anomaly (DoTA) benchmark, representing an increase of close to 5 percentage points from the case where scene graph information is not taken into account.
