Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

Aaron Lohner; Francesco Compagno; Jonathan Francis; Alessandro Oltramari

Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

Aaron Lohner, Francesco Compagno, Jonathan Francis, Alessandro Oltramari

TL;DR

The paper tackles traffic accident understanding for autonomous driving by representing traffic scenes as scene graphs and aligning this graph modality with vision and language encoders. It introduces Traffic-Scene-Graph Inference (TSGi), a four-stage pipeline that pre-processes videos, encodes SGs with a MRGCN-based encoder, and performs multimodal alignment with CLIP/X-CLIP before downstream classification. The study shows that SG embeddings provide substantial gains (∼13 percentage points above random) and can enhance vision-language classifiers, though gains from alignment depend on data and hyperparameters; results highlight the potential and limitations of incorporating scene graphs as a distinct modality. Overall, this work demonstrates the feasibility and value of integrating scene-graph representations into multimodal traffic analysis, with implications for more robust accident understanding in real-world driving systems.

Abstract

Recognizing a traffic accident is an essential part of any autonomous driving or road monitoring system. An accident can appear in a wide variety of forms, and understanding what type of accident is taking place may be useful to prevent it from recurring. This work focuses on classifying traffic scenes into specific accident types. We approach the problem by representing a traffic scene as a graph, where objects such as cars can be represented as nodes, and relative distances and directions between them as edges. This representation of a traffic scene is referred to as a scene graph, and can be used as input for an accident classifier. Better results are obtained with a classifier that fuses the scene graph input with visual and textual representations. This work introduces a multi-stage, multimodal pipeline that pre-processes videos of traffic accidents, encodes them as scene graphs, and aligns this representation with vision and language modalities before executing the classification task. When trained on 4 classes, our method achieves a balanced accuracy score of 57.77% on an (unbalanced) subset of the popular Detection of Traffic Anomaly (DoTA) benchmark, representing an increase of close to 5 percentage points from the case where scene graph information is not taken into account.

Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

TL;DR

Abstract

Paper Structure (23 sections, 3 figures, 5 tables)

This paper contains 23 sections, 3 figures, 5 tables.

Introduction
Related Work
Scene Graphs for Representing Traffic Scenes
Leveraging Scene Graphs with Neural Networks
Multimodal Alignment with Contrastive Representations
Methodology
Data Pre-processing
Scene Graph Encoder
Multimodal Alignment
Fine Tuning for Downstream Task
Experimental Design
Dataset
Baselines and Metrics
Modality Alignment, Pre-training, and Hyperparameters
Implementation Details
...and 8 more sections

Figures (3)

Figure 1: An overview of the TSGi Architecture. Beginning with video and text inputs, video frames are sampled and used to generate scene graphs. Then, alignment training is performed on the three encoders before a prediction head is used to classify the accident.
Figure 2: A scene graph generated using the rs2v rs2v tool. Starting from a video frame (Raw Image), the SGG first detects objects in the scene (Object Detection Image) and generates the BEV (Bird's Eye Image) before creating the scene graph representation (SceneGraph Image). This scene graph shows the ego car relative to two other vehicles, one categorized as in the left lane, the other right. The closer vehicle is recognized as being near collision (with the edge attribute "near_coll"), whereas the farther vehicle is registered in the scene graph as simply being "visible".
Figure 3: Starting from a video frame, the SGG first detects objects in the scene and generates the BEV image before creating the scene graph representation. This scene graph shows the ego car relative to a person on a scooter, categorized as in the left lane. The person is recognized as being near collision, with the edge attribute "near_coll". Note that there is a vehicle detected in the distance (right part of detection image, highlighted in blue) but it is not included in the graph since it is too far.

Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

TL;DR

Abstract

Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (3)