Table of Contents
Fetching ...

Trust Through Transparency: Explainable Social Navigation for Autonomous Mobile Robots via Vision-Language Models

Oluwadamilola Sotomi, Devika Kodi, Aliasghar Arab

TL;DR

This work tackles the challenge of explainability in social navigation for autonomous mobile robots by proposing a multimodal explainability module that fuses Vision-Language Foundation Models, Grad-CAM heatmaps, and large language models (LLMs) to generate real-time, human-readable rationales for navigation decisions. Implemented as a ROS2-based four-node architecture, the module integrates camera perception, heatmap analysis, captioning, and LLM-driven explanations within an autonomous navigation stack. A user study with $N=30$ participants demonstrates that real-time explanations increase trust and perceived understanding, with quantitative alignment to human expectations assessed via confusion matrices. Latency on resource-constrained hardware remains a bottleneck, motivating future work on latency optimization and edge/distributed computing to deliver timely explanations in dynamic environments.

Abstract

Service and assistive robots are increasingly being deployed in dynamic social environments; however, ensuring transparent and explainable interactions remains a significant challenge. This paper presents a multimodal explainability module that integrates vision language models and heat maps to improve transparency during navigation. The proposed system enables robots to perceive, analyze, and articulate their observations through natural language summaries. User studies (n=30) showed a preference of majority for real-time explanations, indicating improved trust and understanding. Our experiments were validated through confusion matrix analysis to assess the level of agreement with human expectations. Our experimental and simulation results emphasize the effectiveness of explainability in autonomous navigation, enhancing trust and interpretability.

Trust Through Transparency: Explainable Social Navigation for Autonomous Mobile Robots via Vision-Language Models

TL;DR

This work tackles the challenge of explainability in social navigation for autonomous mobile robots by proposing a multimodal explainability module that fuses Vision-Language Foundation Models, Grad-CAM heatmaps, and large language models (LLMs) to generate real-time, human-readable rationales for navigation decisions. Implemented as a ROS2-based four-node architecture, the module integrates camera perception, heatmap analysis, captioning, and LLM-driven explanations within an autonomous navigation stack. A user study with participants demonstrates that real-time explanations increase trust and perceived understanding, with quantitative alignment to human expectations assessed via confusion matrices. Latency on resource-constrained hardware remains a bottleneck, motivating future work on latency optimization and edge/distributed computing to deliver timely explanations in dynamic environments.

Abstract

Service and assistive robots are increasingly being deployed in dynamic social environments; however, ensuring transparent and explainable interactions remains a significant challenge. This paper presents a multimodal explainability module that integrates vision language models and heat maps to improve transparency during navigation. The proposed system enables robots to perceive, analyze, and articulate their observations through natural language summaries. User studies (n=30) showed a preference of majority for real-time explanations, indicating improved trust and understanding. Our experiments were validated through confusion matrix analysis to assess the level of agreement with human expectations. Our experimental and simulation results emphasize the effectiveness of explainability in autonomous navigation, enhancing trust and interpretability.

Paper Structure

This paper contains 14 sections, 13 equations, 4 figures, 3 tables, 4 algorithms.

Figures (4)

  • Figure 1: AMR approaches a social setting, demonstrating real-time explainable re-planning to avoid interrupting human interaction.
  • Figure 2: A diagram showing the relationship between the nodes that make up the explainability module.
  • Figure 3: Test 1: User survey results.
  • Figure 4: Test 2: User survey results.