Table of Contents
Fetching ...

INSIGHT: Enhancing Autonomous Driving Safety through Vision-Language Models on Context-Aware Hazard Detection and Edge Case Evaluation

Dianwei Chen, Zifan Zhang, Yuchen Liu, Xianfeng Terry Yang

TL;DR

The paper tackles the challenge of generalizing autonomous driving systems to unpredictable edge cases by introducing INSIGHT, a hierarchical vision-language framework that fuses semantic and visual inputs for generalized hazard tracking. It jointly learns visual-text representations, localizes hazards using attention maps, and generates descriptive hazard cues, all trained via a multi-task loss and reinforced by a convergence analysis under AdamW. The method leverages a Qwen2-VL-7B backbone with LoRA-based fine-tuning on a subset of BDD100K, achieving substantial gains in hazard localization accuracy and text-generation quality compared to baselines, while demonstrating robust edge-case generalization. The results indicate significant practical implications for safer autonomous driving and provide a scalable framework for multimodal scene understanding, localization, and explanation in real-time settings.

Abstract

Autonomous driving systems face significant challenges in handling unpredictable edge-case scenarios, such as adversarial pedestrian movements, dangerous vehicle maneuvers, and sudden environmental changes. Current end-to-end driving models struggle with generalization to these rare events due to limitations in traditional detection and prediction approaches. To address this, we propose INSIGHT (Integration of Semantic and Visual Inputs for Generalized Hazard Tracking), a hierarchical vision-language model (VLM) framework designed to enhance hazard detection and edge-case evaluation. By using multimodal data fusion, our approach integrates semantic and visual representations, enabling precise interpretation of driving scenarios and accurate forecasting of potential dangers. Through supervised fine-tuning of VLMs, we optimize spatial hazard localization using attention-based mechanisms and coordinate regression techniques. Experimental results on the BDD100K dataset demonstrate a substantial improvement in hazard prediction straightforwardness and accuracy over existing models, achieving a notable increase in generalization performance. This advancement enhances the robustness and safety of autonomous driving systems, ensuring improved situational awareness and potential decision-making in complex real-world scenarios.

INSIGHT: Enhancing Autonomous Driving Safety through Vision-Language Models on Context-Aware Hazard Detection and Edge Case Evaluation

TL;DR

The paper tackles the challenge of generalizing autonomous driving systems to unpredictable edge cases by introducing INSIGHT, a hierarchical vision-language framework that fuses semantic and visual inputs for generalized hazard tracking. It jointly learns visual-text representations, localizes hazards using attention maps, and generates descriptive hazard cues, all trained via a multi-task loss and reinforced by a convergence analysis under AdamW. The method leverages a Qwen2-VL-7B backbone with LoRA-based fine-tuning on a subset of BDD100K, achieving substantial gains in hazard localization accuracy and text-generation quality compared to baselines, while demonstrating robust edge-case generalization. The results indicate significant practical implications for safer autonomous driving and provide a scalable framework for multimodal scene understanding, localization, and explanation in real-time settings.

Abstract

Autonomous driving systems face significant challenges in handling unpredictable edge-case scenarios, such as adversarial pedestrian movements, dangerous vehicle maneuvers, and sudden environmental changes. Current end-to-end driving models struggle with generalization to these rare events due to limitations in traditional detection and prediction approaches. To address this, we propose INSIGHT (Integration of Semantic and Visual Inputs for Generalized Hazard Tracking), a hierarchical vision-language model (VLM) framework designed to enhance hazard detection and edge-case evaluation. By using multimodal data fusion, our approach integrates semantic and visual representations, enabling precise interpretation of driving scenarios and accurate forecasting of potential dangers. Through supervised fine-tuning of VLMs, we optimize spatial hazard localization using attention-based mechanisms and coordinate regression techniques. Experimental results on the BDD100K dataset demonstrate a substantial improvement in hazard prediction straightforwardness and accuracy over existing models, achieving a notable increase in generalization performance. This advancement enhances the robustness and safety of autonomous driving systems, ensuring improved situational awareness and potential decision-making in complex real-world scenarios.

Paper Structure

This paper contains 29 sections, 12 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Integration of Semantic and Visual Inputs for Generalized Hazard Tracking through Supervised Fine-tuning VLMs
  • Figure 2: Manual Annotation for dataset preprocessing
  • Figure 3: Supervised fine-tuning metrics on Qwen2-VL-7B.
  • Figure 4: Comparison of ground truth and predicted coordinates
  • Figure 5: Demonstration of generalization ability