Table of Contents
Fetching ...

ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding

Liang Shi, Boyu Jiang, Tong Zeng, Feng Guo

TL;DR

ScVLM addresses the challenge of rare safety-critical events (SCEs) in driving by coupling supervised event-type classification with contrastive conflict-type learning and subsequent LLM-based narrative generation. The method fuses a video encoder and classifiers with an LLM to produce contextually accurate SCE narratives while mitigating VLM hallucinations, demonstrated on SHRP 2 NDS with over 8,600 SCEs. It achieves superior narrative quality and more reliable event descriptions compared to baselines, across both full and SCE-focused evaluations. The work offers a practical framework for safer AI systems in automated driving and provides public code to enhance reproducibility.

Abstract

Accurately identifying, understanding and describing traffic safety-critical events (SCEs), including crashes, tire strikes, and near-crashes, is crucial for advanced driver assistance systems, automated driving systems, and traffic safety. As SCEs are rare events, most general vision-language models (VLMs) have not been trained sufficiently to link SCE videos and narratives, which could lead to hallucinations and missing key safety characteristics. Here, we introduce ScVLM, a novel hybrid methodology that integrates supervised and contrastive learning techniques to classify the severity and types of SCEs, as well as to generate narrative descriptions of SCEs. This approach utilizes classification to enhance VLMs' comprehension of driving videos and improve the rationality of event descriptions. The proposed approach is trained on and evaluated by more than 8,600 SCEs from the Second Strategic Highway Research Program Naturalistic Driving Study dataset, the largest publicly accessible driving dataset with videos and SCE annotations. The results demonstrate the superiority of the proposed approach in generating contextually accurate event descriptions and mitigating VLM hallucinations. The code will be available at https://github.com/datadrivenwheels/ScVLM.

ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding

TL;DR

ScVLM addresses the challenge of rare safety-critical events (SCEs) in driving by coupling supervised event-type classification with contrastive conflict-type learning and subsequent LLM-based narrative generation. The method fuses a video encoder and classifiers with an LLM to produce contextually accurate SCE narratives while mitigating VLM hallucinations, demonstrated on SHRP 2 NDS with over 8,600 SCEs. It achieves superior narrative quality and more reliable event descriptions compared to baselines, across both full and SCE-focused evaluations. The work offers a practical framework for safer AI systems in automated driving and provides public code to enhance reproducibility.

Abstract

Accurately identifying, understanding and describing traffic safety-critical events (SCEs), including crashes, tire strikes, and near-crashes, is crucial for advanced driver assistance systems, automated driving systems, and traffic safety. As SCEs are rare events, most general vision-language models (VLMs) have not been trained sufficiently to link SCE videos and narratives, which could lead to hallucinations and missing key safety characteristics. Here, we introduce ScVLM, a novel hybrid methodology that integrates supervised and contrastive learning techniques to classify the severity and types of SCEs, as well as to generate narrative descriptions of SCEs. This approach utilizes classification to enhance VLMs' comprehension of driving videos and improve the rationality of event descriptions. The proposed approach is trained on and evaluated by more than 8,600 SCEs from the Second Strategic Highway Research Program Naturalistic Driving Study dataset, the largest publicly accessible driving dataset with videos and SCE annotations. The results demonstrate the superiority of the proposed approach in generating contextually accurate event descriptions and mitigating VLM hallucinations. The code will be available at https://github.com/datadrivenwheels/ScVLM.
Paper Structure (11 sections, 8 equations, 9 figures, 5 tables)

This paper contains 11 sections, 8 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Example scene understanding result by VideoLLaMA2 (red highlights are the incorrect answers).
  • Figure 2: The proposed multi-stage approach for generating narrative descriptions of SCEs from driving videos. The process integrates supervised learning for event classification (e.g., crash, near-crash) and contrastive learning for conflict type identification (e.g., conflict with lead vehicle, single vehicle conflict). The VLM extracts visual and environmental information, which is further refined by an LLM to produce a detailed narrative of the SCE.
  • Figure 3: Supervised learning structure for video data.
  • Figure 4: Contrastive learning structure for video-text pair data.
  • Figure 5: Inference procedure of contrastive learning approach.
  • ...and 4 more figures