Automating eHMI Action Design with LLMs for Automated Vehicle Communication
Ding Xia, Xinyue Gui, Fan Gao, Dongyuan Li, Mark Colley, Takeo Igarashi
TL;DR
This work addresses the lack of explicit communication between automated vehicles and other road users by introducing an LLM-driven framework to design executable eHMI actions. It presents a two-step LLM-Blender pipeline that generates action sequences and renders them as video clips, along with an Action-Design Scoring Dataset of 320 clips to benchmark action design against human judgments. Two automated evaluators, an Action Reference Score based on DTW and a Vision-Language Model rater, enable scalable benchmarking across 18 LLMs. Findings show pretrained LLMs approach human-level performance, with reasoning-enabled models offering the strongest results, suggesting a scalable, adaptable approach to eHMI design and broader application to human–robot communication domains.
Abstract
The absence of explicit communication channels between automated vehicles (AVs) and other road users requires the use of external Human-Machine Interfaces (eHMIs) to convey messages effectively in uncertain scenarios. Currently, most eHMI studies employ predefined text messages and manually designed actions to perform these messages, which limits the real-world deployment of eHMIs, where adaptability in dynamic scenarios is essential. Given the generalizability and versatility of large language models (LLMs), they could potentially serve as automated action designers for the message-action design task. To validate this idea, we make three contributions: (1) We propose a pipeline that integrates LLMs and 3D renderers, using LLMs as action designers to generate executable actions for controlling eHMIs and rendering action clips. (2) We collect a user-rated Action-Design Scoring dataset comprising a total of 320 action sequences for eight intended messages and four representative eHMI modalities. The dataset validates that LLMs can translate intended messages into actions close to a human level, particularly for reasoning-enabled LLMs. (3) We introduce two automated raters, Action Reference Score (ARS) and Vision-Language Models (VLMs), to benchmark 18 LLMs, finding that the VLM aligns with human preferences yet varies across eHMI modalities.
