"What's Happening"- A Human-centered Multimodal Interpreter Explaining the Actions of Autonomous Vehicles
Xuewen Luo, Fan Ding, Ruiqi Chen, Rishikesh Panda, Junnyong Loo, Shuyun Zhang
TL;DR
The paper addresses public distrust in autonomous vehicles by proposing a Human-centered Multimodal Interpreter (HMI) that delivers real-time explanations through a visual BEV/map/text interface and a voice channel powered by a prompt-engineered LLM. It integrates multimodal feedback to tailor explanations to user preferences, aiming to improve transparency, perceived safety, and trust. An empirical user study shows the HMI significantly boosts passenger trust across ordinary, low-visibility, and emergency scenarios, with notable gains in routine conditions. This work demonstrates that multimodal, passenger-facing explanations can enhance acceptance and reliability of autonomous vehicles, guiding future designs toward more transparent and user-centric AV interfaces.
Abstract
Public distrust of self-driving cars is growing. Studies emphasize the need for interpreting the behavior of these vehicles to passengers to promote trust in autonomous systems. Interpreters can enhance trust by improving transparency and reducing perceived risk. However, current solutions often lack a human-centric approach to integrating multimodal interpretations. This paper introduces a novel Human-centered Multimodal Interpreter (HMI) system that leverages human preferences to provide visual, textual, and auditory feedback. The system combines a visual interface with Bird's Eye View (BEV), map, and text display, along with voice interaction using a fine-tuned large language model (LLM). Our user study, involving diverse participants, demonstrated that the HMI system significantly boosts passenger trust in AVs, increasing average trust levels by over 8%, with trust in ordinary environments rising by up to 30%. These results underscore the potential of the HMI system to improve the acceptance and reliability of autonomous vehicles by providing clear, real-time, and context-sensitive explanations of vehicle actions.
