Table of Contents
Fetching ...

Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

Haruki Sakajo, Hiroshi Takato, Hiroshi Tsutsui, Komei Soda, Hidetaka Kamigaito, Taro Watanabe

TL;DR

This work addresses automatic generation of safe driving instructions using large-scale Vision Language Models (LVLMs) with synchronized driver-facing and road-facing video inputs. It builds a two-question, conversation-style dataset annotated through a three-step process (primary/sub-event selection and GPT-4o-generated summaries) with expert verification, and evaluates LVLMs both in zero-shot and fine-tuned regimes. The study shows that fine-tuned LVLMs can produce accurate and safety-oriented driving guidance, though detecting subtle or complex events remains challenging, and reveals biases and limitations in current models. Overall, the results demonstrate the feasibility of LVLM-based driving instruction while outlining concrete improvements needed for robust real-world deployment and safer autonomous driving systems.

Abstract

Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection. These capabilities have promising applications in various industrial domains, such as autonomous driving. For example, LVLMs can generate safety-oriented descriptions of videos captured by road-facing cameras. However, ensuring comprehensive safety requires monitoring driver-facing views as well to detect risky events, such as the use of mobiles while driving. Thus, the ability to process synchronized inputs is necessary from both driver-facing and road-facing cameras. In this study, we develop models and investigate the capabilities of LVLMs by constructing a dataset and evaluating their performance on this dataset. Our experimental results demonstrate that while pre-trained LVLMs have limited effectiveness, fine-tuned LVLMs can generate accurate and safety-aware driving instructions. Nonetheless, several challenges remain, particularly in detecting subtle or complex events in the video. Our findings and error analysis provide valuable insights that can contribute to the improvement of LVLM-based systems in this domain.

Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

TL;DR

This work addresses automatic generation of safe driving instructions using large-scale Vision Language Models (LVLMs) with synchronized driver-facing and road-facing video inputs. It builds a two-question, conversation-style dataset annotated through a three-step process (primary/sub-event selection and GPT-4o-generated summaries) with expert verification, and evaluates LVLMs both in zero-shot and fine-tuned regimes. The study shows that fine-tuned LVLMs can produce accurate and safety-oriented driving guidance, though detecting subtle or complex events remains challenging, and reveals biases and limitations in current models. Overall, the results demonstrate the feasibility of LVLM-based driving instruction while outlining concrete improvements needed for robust real-world deployment and safer autonomous driving systems.

Abstract

Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection. These capabilities have promising applications in various industrial domains, such as autonomous driving. For example, LVLMs can generate safety-oriented descriptions of videos captured by road-facing cameras. However, ensuring comprehensive safety requires monitoring driver-facing views as well to detect risky events, such as the use of mobiles while driving. Thus, the ability to process synchronized inputs is necessary from both driver-facing and road-facing cameras. In this study, we develop models and investigate the capabilities of LVLMs by constructing a dataset and evaluating their performance on this dataset. Our experimental results demonstrate that while pre-trained LVLMs have limited effectiveness, fine-tuned LVLMs can generate accurate and safety-aware driving instructions. Nonetheless, several challenges remain, particularly in detecting subtle or complex events in the video. Our findings and error analysis provide valuable insights that can contribute to the improvement of LVLM-based systems in this domain.

Paper Structure

This paper contains 32 sections, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Illustration of an application of this study. A Model provides driving instructions for the given video.
  • Figure 2: Overview of our dataset construction approach. The dataset contains videos with synchronized driver-facing and road-facing views. GPT-4o generates the gold answers to event detection and safe driving instruction questions based on annotated labels for each video.
  • Figure 3: Score distribution of BERTScore F1 of event detection.
  • Figure 4: Score distribution of BLEU of event detection.
  • Figure 5: Score distribution of BERTScore F1 of safe driving instruction.
  • ...and 1 more figures