Table of Contents
Fetching ...

Zero-shot Hazard Identification in Autonomous Driving: A Case Study on the COOOL Benchmark

Lukas Picek, Vojtěch Čermák, Marek Hanzl

TL;DR

This paper tackles zero-shot hazard identification in autonomous driving by submitting to the COOOL benchmark. It presents a three-task pipeline combining kernel-based change point detection on bounding boxes and optical flow for driver reaction, a proximity- and ViT-based method for hazard identification, and MOLMO-based hazard captioning. The approach yields substantial gains over baselines (a 33% reduction in relative error) and achieves 2nd place among 32 teams, demonstrating the value of hybrid CV and vision-language strategies for open-set hazards. The work also discusses limitations due to data quality and domain shift and outlines directions for robustness and real-time deployment.

Abstract

This paper presents our submission to the COOOL competition, a novel benchmark for detecting and classifying out-of-label hazards in autonomous driving. Our approach integrates diverse methods across three core tasks: (i) driver reaction detection, (ii) hazard object identification, and (iii) hazard captioning. We propose kernel-based change point detection on bounding boxes and optical flow dynamics for driver reaction detection to analyze motion patterns. For hazard identification, we combined a naive proximity-based strategy with object classification using a pre-trained ViT model. At last, for hazard captioning, we used the MOLMO vision-language model with tailored prompts to generate precise and context-aware descriptions of rare and low-resolution hazards. The proposed pipeline outperformed the baseline methods by a large margin, reducing the relative error by 33%, and scored 2nd on the final leaderboard consisting of 32 teams.

Zero-shot Hazard Identification in Autonomous Driving: A Case Study on the COOOL Benchmark

TL;DR

This paper tackles zero-shot hazard identification in autonomous driving by submitting to the COOOL benchmark. It presents a three-task pipeline combining kernel-based change point detection on bounding boxes and optical flow for driver reaction, a proximity- and ViT-based method for hazard identification, and MOLMO-based hazard captioning. The approach yields substantial gains over baselines (a 33% reduction in relative error) and achieves 2nd place among 32 teams, demonstrating the value of hybrid CV and vision-language strategies for open-set hazards. The work also discusses limitations due to data quality and domain shift and outlines directions for robustness and real-time deployment.

Abstract

This paper presents our submission to the COOOL competition, a novel benchmark for detecting and classifying out-of-label hazards in autonomous driving. Our approach integrates diverse methods across three core tasks: (i) driver reaction detection, (ii) hazard object identification, and (iii) hazard captioning. We propose kernel-based change point detection on bounding boxes and optical flow dynamics for driver reaction detection to analyze motion patterns. For hazard identification, we combined a naive proximity-based strategy with object classification using a pre-trained ViT model. At last, for hazard captioning, we used the MOLMO vision-language model with tailored prompts to generate precise and context-aware descriptions of rare and low-resolution hazards. The proposed pipeline outperformed the baseline methods by a large margin, reducing the relative error by 33%, and scored 2nd on the final leaderboard consisting of 32 teams.
Paper Structure (21 sections, 6 equations, 5 figures, 5 tables)

This paper contains 21 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: COOOL benchmark focus on zero-shot identification of driver reaction, hazardous objects "recognition," and hazard captioning. A simplified result of our approach is displayed on the selected frame from one of the testing videos. Colors depict the hazard state of each object. Classification is above the object.
  • Figure 2: Dashcam views. The COOOL dataset videos originate from different parts of the world and are of various quality.
  • Figure 3: Cropped out hazard object. The COOOL dataset includes a wide variety of animals (top) and other (bottom) objects.
  • Figure 4: Driver reaction recognition. The optical flow graph demonstrates pixel-level motion intensity over time, with spikes during significant vehicle or background movements. The Object size dynamic graph visualizes the change in bounding box sizes of hazards. Combined (i.e., Mean ensemble in the image), these methods quantify motion and reason about situational risks to identify potential hazards effectively and accurately.
  • Figure 5: Hazard captioning with MOMLO. Two prompts (1$^{st}$ and 3$^{rd}$ row) and given results. "Correct" answer highlighted in green.