Enhancing Explainability with Multimodal Context Representations for Smarter Robots

Anargh Viswanath; Lokesh Veeramacheneni; Hendrik Buschmeier

Enhancing Explainability with Multimodal Context Representations for Smarter Robots

Anargh Viswanath, Lokesh Veeramacheneni, Hendrik Buschmeier

TL;DR

This work tackles the explainability gap in multimodal HRI by proposing a generalized context-representation framework that fuses speech and vision. It introduces a two-module methodology: Multimodal Joint Representation to learn a shared speech-visual embedding, and Temporal Alignment to synchronize across time and derive a relevance score between verbal utterances and visual scenes. Explainability is addressed through two facets: useful representations in a coherent embedding space and multi-level abstractions for the robot, user, and developer, enabling grounding, intent recognition, and clearer debugging. The approach aims to improve trust and effective collaboration in human-centered spaces by facilitating transparent, context-aware decisions in robotic systems and providing actionable insights across stakeholders.

Abstract

Artificial Intelligence (AI) has significantly advanced in recent years, driving innovation across various fields, especially in robotics. Even though robots can perform complex tasks with increasing autonomy, challenges remain in ensuring explainability and user-centered design for effective interaction. A key issue in Human-Robot Interaction (HRI) is enabling robots to effectively perceive and reason over multimodal inputs, such as audio and vision, to foster trust and seamless collaboration. In this paper, we propose a generalized and explainable multimodal framework for context representation, designed to improve the fusion of speech and vision modalities. We introduce a use case on assessing 'Relevance' between verbal utterances from the user and visual scene perception of the robot. We present our methodology with a Multimodal Joint Representation module and a Temporal Alignment module, which can allow robots to evaluate relevance by temporally aligning multimodal inputs. Finally, we discuss how the proposed framework for context representation can help with various aspects of explainability in HRI.

Enhancing Explainability with Multimodal Context Representations for Smarter Robots

TL;DR

Abstract

Enhancing Explainability with Multimodal Context Representations for Smarter Robots

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)