Table of Contents
Fetching ...

Exploring the Potential of Multi-Modal AI for Driving Hazard Prediction

Korawat Charoenpitaks, Van-Quang Nguyen, Masanori Suganuma, Masahiro Takahashi, Ryoma Niihara, Takayuki Okatani

TL;DR

The paper tackles driving hazard prediction from a single dashcam image by framing it as visual abductive reasoning and introducing the DHPR dataset with 14,975 annotated scenes (speed, hazard explanations, and entities). It proposes a CLIP-based baseline extended with dual encoders and multimodal adapters to perform retrieval and generation tasks, and compares against state-of-the-art visual-language models and GPT-4V. Results show the proposed architecture improves cross-modal grounding and hazard explanation quality, while GPT-4V offers valuable zero-shot semantic assessments though lagging on traditional metrics. The work highlights the feasibility of multi-modal abductive reasoning for driving hazards, while noting limitations of static imagery and outlining directions for incorporating video data and richer vehicle signals in future work.

Abstract

This paper addresses the problem of predicting hazards that drivers may encounter while driving a car. We formulate it as a task of anticipating impending accidents using a single input image captured by car dashcams. Unlike existing approaches to driving hazard prediction that rely on computational simulations or anomaly detection from videos, this study focuses on high-level inference from static images. The problem needs predicting and reasoning about future events based on uncertain observations, which falls under visual abductive reasoning. To enable research in this understudied area, a new dataset named the DHPR (Driving Hazard Prediction and Reasoning) dataset is created. The dataset consists of 15K dashcam images of street scenes, and each image is associated with a tuple containing car speed, a hypothesized hazard description, and visual entities present in the scene. These are annotated by human annotators, who identify risky scenes and provide descriptions of potential accidents that could occur a few seconds later. We present several baseline methods and evaluate their performance on our dataset, identifying remaining issues and discussing future directions. This study contributes to the field by introducing a novel problem formulation and dataset, enabling researchers to explore the potential of multi-modal AI for driving hazard prediction.

Exploring the Potential of Multi-Modal AI for Driving Hazard Prediction

TL;DR

The paper tackles driving hazard prediction from a single dashcam image by framing it as visual abductive reasoning and introducing the DHPR dataset with 14,975 annotated scenes (speed, hazard explanations, and entities). It proposes a CLIP-based baseline extended with dual encoders and multimodal adapters to perform retrieval and generation tasks, and compares against state-of-the-art visual-language models and GPT-4V. Results show the proposed architecture improves cross-modal grounding and hazard explanation quality, while GPT-4V offers valuable zero-shot semantic assessments though lagging on traditional metrics. The work highlights the feasibility of multi-modal abductive reasoning for driving hazards, while noting limitations of static imagery and outlining directions for incorporating video data and richer vehicle signals in future work.

Abstract

This paper addresses the problem of predicting hazards that drivers may encounter while driving a car. We formulate it as a task of anticipating impending accidents using a single input image captured by car dashcams. Unlike existing approaches to driving hazard prediction that rely on computational simulations or anomaly detection from videos, this study focuses on high-level inference from static images. The problem needs predicting and reasoning about future events based on uncertain observations, which falls under visual abductive reasoning. To enable research in this understudied area, a new dataset named the DHPR (Driving Hazard Prediction and Reasoning) dataset is created. The dataset consists of 15K dashcam images of street scenes, and each image is associated with a tuple containing car speed, a hypothesized hazard description, and visual entities present in the scene. These are annotated by human annotators, who identify risky scenes and provide descriptions of potential accidents that could occur a few seconds later. We present several baseline methods and evaluate their performance on our dataset, identifying remaining issues and discussing future directions. This study contributes to the field by introducing a novel problem formulation and dataset, enabling researchers to explore the potential of multi-modal AI for driving hazard prediction.
Paper Structure (23 sections, 4 figures, 5 tables)

This paper contains 23 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Example of driving hazard prediction from a single dashcam image. The pedestrian in the green box may be attempting to flag down a taxi, and the taxi may abruptly stop in front of our car to offer them a ride.
  • Figure 2: Illustration of the DHPR dataset with example annotations (left) and hazard explanations retrieved and generated by our model and GPT-4V (right).
  • Figure 3: The proposed method for the retrieval and generation tasks.
  • Figure 4: Examples of hazard explanations generated by our baseline model and GPT-4V.