Trust Through Transparency: Explainable Social Navigation for Autonomous Mobile Robots via Vision-Language Models
Oluwadamilola Sotomi, Devika Kodi, Aliasghar Arab
TL;DR
This work tackles the challenge of explainability in social navigation for autonomous mobile robots by proposing a multimodal explainability module that fuses Vision-Language Foundation Models, Grad-CAM heatmaps, and large language models (LLMs) to generate real-time, human-readable rationales for navigation decisions. Implemented as a ROS2-based four-node architecture, the module integrates camera perception, heatmap analysis, captioning, and LLM-driven explanations within an autonomous navigation stack. A user study with $N=30$ participants demonstrates that real-time explanations increase trust and perceived understanding, with quantitative alignment to human expectations assessed via confusion matrices. Latency on resource-constrained hardware remains a bottleneck, motivating future work on latency optimization and edge/distributed computing to deliver timely explanations in dynamic environments.
Abstract
Service and assistive robots are increasingly being deployed in dynamic social environments; however, ensuring transparent and explainable interactions remains a significant challenge. This paper presents a multimodal explainability module that integrates vision language models and heat maps to improve transparency during navigation. The proposed system enables robots to perceive, analyze, and articulate their observations through natural language summaries. User studies (n=30) showed a preference of majority for real-time explanations, indicating improved trust and understanding. Our experiments were validated through confusion matrix analysis to assess the level of agreement with human expectations. Our experimental and simulation results emphasize the effectiveness of explainability in autonomous navigation, enhancing trust and interpretability.
