Table of Contents
Fetching ...

Automating eHMI Action Design with LLMs for Automated Vehicle Communication

Ding Xia, Xinyue Gui, Fan Gao, Dongyuan Li, Mark Colley, Takeo Igarashi

TL;DR

This work addresses the lack of explicit communication between automated vehicles and other road users by introducing an LLM-driven framework to design executable eHMI actions. It presents a two-step LLM-Blender pipeline that generates action sequences and renders them as video clips, along with an Action-Design Scoring Dataset of 320 clips to benchmark action design against human judgments. Two automated evaluators, an Action Reference Score based on DTW and a Vision-Language Model rater, enable scalable benchmarking across 18 LLMs. Findings show pretrained LLMs approach human-level performance, with reasoning-enabled models offering the strongest results, suggesting a scalable, adaptable approach to eHMI design and broader application to human–robot communication domains.

Abstract

The absence of explicit communication channels between automated vehicles (AVs) and other road users requires the use of external Human-Machine Interfaces (eHMIs) to convey messages effectively in uncertain scenarios. Currently, most eHMI studies employ predefined text messages and manually designed actions to perform these messages, which limits the real-world deployment of eHMIs, where adaptability in dynamic scenarios is essential. Given the generalizability and versatility of large language models (LLMs), they could potentially serve as automated action designers for the message-action design task. To validate this idea, we make three contributions: (1) We propose a pipeline that integrates LLMs and 3D renderers, using LLMs as action designers to generate executable actions for controlling eHMIs and rendering action clips. (2) We collect a user-rated Action-Design Scoring dataset comprising a total of 320 action sequences for eight intended messages and four representative eHMI modalities. The dataset validates that LLMs can translate intended messages into actions close to a human level, particularly for reasoning-enabled LLMs. (3) We introduce two automated raters, Action Reference Score (ARS) and Vision-Language Models (VLMs), to benchmark 18 LLMs, finding that the VLM aligns with human preferences yet varies across eHMI modalities.

Automating eHMI Action Design with LLMs for Automated Vehicle Communication

TL;DR

This work addresses the lack of explicit communication between automated vehicles and other road users by introducing an LLM-driven framework to design executable eHMI actions. It presents a two-step LLM-Blender pipeline that generates action sequences and renders them as video clips, along with an Action-Design Scoring Dataset of 320 clips to benchmark action design against human judgments. Two automated evaluators, an Action Reference Score based on DTW and a Vision-Language Model rater, enable scalable benchmarking across 18 LLMs. Findings show pretrained LLMs approach human-level performance, with reasoning-enabled models offering the strongest results, suggesting a scalable, adaptable approach to eHMI design and broader application to human–robot communication domains.

Abstract

The absence of explicit communication channels between automated vehicles (AVs) and other road users requires the use of external Human-Machine Interfaces (eHMIs) to convey messages effectively in uncertain scenarios. Currently, most eHMI studies employ predefined text messages and manually designed actions to perform these messages, which limits the real-world deployment of eHMIs, where adaptability in dynamic scenarios is essential. Given the generalizability and versatility of large language models (LLMs), they could potentially serve as automated action designers for the message-action design task. To validate this idea, we make three contributions: (1) We propose a pipeline that integrates LLMs and 3D renderers, using LLMs as action designers to generate executable actions for controlling eHMIs and rendering action clips. (2) We collect a user-rated Action-Design Scoring dataset comprising a total of 320 action sequences for eight intended messages and four representative eHMI modalities. The dataset validates that LLMs can translate intended messages into actions close to a human level, particularly for reasoning-enabled LLMs. (3) We introduce two automated raters, Action Reference Score (ARS) and Vision-Language Models (VLMs), to benchmark 18 LLMs, finding that the VLM aligns with human preferences yet varies across eHMI modalities.

Paper Structure

This paper contains 30 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Setup illustration and action demos. a) Four types of eHMIs are installed on the vehicle separately; b) Demo actions of the arm convey the message: "Say Hello". The shaded action indicates the subsequent status; c) Demo actions of the eye: "Help me out".
  • Figure 2: Dataset Asset, Pipeline, and Human Scoring. Dataset assets contain four representative eHMIs and eight intended messages from different interaction types. In the pipeline, we develop eight corresponding Blender scenarios and render actions designed by LLMs or human experts to clips. During the human scoring phase, ten participants evaluate each action clip using a five-point Likert scale.
  • Figure 3: Relationship between action clip length and evaluation scores. The plot compares scores from human raters and the VLM rater (Qwen-QvQ-Max).
  • Figure 4: Comparative Distribution of Action-Design Scoring, where each action clip is rated using a 5-point Likert scale. Human designers are most frequently awarded a score of 5 (Strongly Agree), while GPT-o1 received the highest number of 4 (Agree) scores.
  • Figure 5: Case study of the Action-Design Scoring dataset. For a clearer demonstration, we present images shown to VLM raters. Cases (a) and (b) demonstrate that LLMs tend to include expressions of gratitude, which are unnecessary and create confusion. Case (c) illustrates unclear information conveying that "the pedestrian is coming from the right". Case (d) is a perfect demonstration of human design, focusing only on important information and ignoring information that "a bus is coming from the left".
  • ...and 9 more figures