Table of Contents
Fetching ...

Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety

Shashank Shriram, Srinivasa Perisetla, Aryan Keskar, Harsha Krishnaswamy, Tonko Emil Westerhof Bossen, Andreas Møgelmose, Ross Greer

TL;DR

This work tackles open-set hazard detection for autonomous driving by coupling vision-language models (VLMs) with large language models (LLMs) in two parallel tracks to detect, describe, and localize hazards beyond predefined categories. It introduces COOOLER, an enhanced COOOL benchmark with denoised video data and open-set hazard descriptions, evaluated via cosine similarity and new metrics BESM and SAM, yielding moderate but meaningful performance improvements. The approach demonstrates context-aware hazard reasoning and cross-modal verification, highlighting the potential of zero-shot reasoning for safer autonomous navigation while acknowledging limitations in small/occluded hazard detection and cross-track merging. Overall, the study provides a structured framework for open-world hazard understanding and points to practical pathways for real-time hazard assessment in autonomous driving systems.

Abstract

Detecting anomalous hazards in visual data, particularly in video streams, is a critical challenge in autonomous driving. Existing models often struggle with unpredictable, out-of-label hazards due to their reliance on predefined object categories. In this paper, we propose a multimodal approach that integrates vision-language reasoning with zero-shot object detection to improve hazard identification and explanation. Our pipeline consists of a Vision-Language Model (VLM), a Large Language Model (LLM), in order to detect hazardous objects within a traffic scene. We refine object detection by incorporating OpenAI's CLIP model to match predicted hazards with bounding box annotations, improving localization accuracy. To assess model performance, we create a ground truth dataset by denoising and extending the foundational COOOL (Challenge-of-Out-of-Label) anomaly detection benchmark dataset with complete natural language descriptions for hazard annotations. We define a means of hazard detection and labeling evaluation on the extended dataset using cosine similarity. This evaluation considers the semantic similarity between the predicted hazard description and the annotated ground truth for each video. Additionally, we release a set of tools for structuring and managing large-scale hazard detection datasets. Our findings highlight the strengths and limitations of current vision-language-based approaches, offering insights into future improvements in autonomous hazard detection systems. Our models, scripts, and data can be found at https://github.com/mi3labucm/COOOLER.git

Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety

TL;DR

This work tackles open-set hazard detection for autonomous driving by coupling vision-language models (VLMs) with large language models (LLMs) in two parallel tracks to detect, describe, and localize hazards beyond predefined categories. It introduces COOOLER, an enhanced COOOL benchmark with denoised video data and open-set hazard descriptions, evaluated via cosine similarity and new metrics BESM and SAM, yielding moderate but meaningful performance improvements. The approach demonstrates context-aware hazard reasoning and cross-modal verification, highlighting the potential of zero-shot reasoning for safer autonomous navigation while acknowledging limitations in small/occluded hazard detection and cross-track merging. Overall, the study provides a structured framework for open-world hazard understanding and points to practical pathways for real-time hazard assessment in autonomous driving systems.

Abstract

Detecting anomalous hazards in visual data, particularly in video streams, is a critical challenge in autonomous driving. Existing models often struggle with unpredictable, out-of-label hazards due to their reliance on predefined object categories. In this paper, we propose a multimodal approach that integrates vision-language reasoning with zero-shot object detection to improve hazard identification and explanation. Our pipeline consists of a Vision-Language Model (VLM), a Large Language Model (LLM), in order to detect hazardous objects within a traffic scene. We refine object detection by incorporating OpenAI's CLIP model to match predicted hazards with bounding box annotations, improving localization accuracy. To assess model performance, we create a ground truth dataset by denoising and extending the foundational COOOL (Challenge-of-Out-of-Label) anomaly detection benchmark dataset with complete natural language descriptions for hazard annotations. We define a means of hazard detection and labeling evaluation on the extended dataset using cosine similarity. This evaluation considers the semantic similarity between the predicted hazard description and the annotated ground truth for each video. Additionally, we release a set of tools for structuring and managing large-scale hazard detection datasets. Our findings highlight the strengths and limitations of current vision-language-based approaches, offering insights into future improvements in autonomous hazard detection systems. Our models, scripts, and data can be found at https://github.com/mi3labucm/COOOLER.git

Paper Structure

This paper contains 23 sections, 5 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Our pipeline consists of two parallel tracks, both leveraging vision-language models (VLMs) and large language models (LLMs) to analyze driving scenes. One track focuses on generating detailed linguistic scene descriptions, while the other extracts key object nouns for structured representation. Through iterative refinement, these tracks produce a comprehensive list of elements most likely to be present in the scene. A zero-shot hazard verification agent then evaluates these elements to detect hazardous and anomalous objects with high precision.
  • Figure 2: This figure represents a visualization of our object detection process, where object snippets are compared to label embeddings using OpenAI's CLIP model. The heatmap displays similarity scores, with each row representing an object snippet and each column corresponding to a potential label. The top 10th percentile of similarity scores (highlighted in green) is selected to identify the most relevant label associations. In the final stage, flagged elements that share common labels are grouped together, enabling more accurate object detection and classification.