Table of Contents
Fetching ...

Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

Linus Nwankwo, Elmar Rueckert

TL;DR

The paper addresses the challenge of natural and robust human-robot interaction with autonomous agents in real-world environments. It introduces a dual-modality pipeline that combines pre-trained large language models (LLMs), multimodal visual language models (VLMs), and speech recognition (SR) to interpret vocal and textual input and map it to robot actions via a ROS-based execution layer. Key results show vocal command understanding accuracy of 87.55% and command execution accuracy of 86.27%, with an average latency of about 0.89 seconds, while text-based interactions achieve higher nominal accuracy; the approach demonstrates robustness to accent variation and environmental noise by enabling modality switching. The work advances practical HRI by enabling more intuitive, natural interactions and provides a path toward noise-robust, context-aware human-robot collaboration in real-world settings.

Abstract

In this paper, we extended the method proposed in [21] to enable humans to interact naturally with autonomous agents through vocal and textual conversations. Our extended method exploits the inherent capabilities of pre-trained large language models (LLMs), multimodal visual language models (VLMs), and speech recognition (SR) models to decode the high-level natural language conversations and semantic understanding of the robot's task environment, and abstract them to the robot's actionable commands or queries. We performed a quantitative evaluation of our framework's natural vocal conversation understanding with participants from different racial backgrounds and English language accents. The participants interacted with the robot using both spoken and textual instructional commands. Based on the logged interaction data, our framework achieved 87.55% vocal commands decoding accuracy, 86.27% commands execution success, and an average latency of 0.89 seconds from receiving the participants' vocal chat commands to initiating the robot's actual physical action. The video demonstrations of this paper can be found at https://linusnep.github.io/MTCC-IRoNL/.

Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

TL;DR

The paper addresses the challenge of natural and robust human-robot interaction with autonomous agents in real-world environments. It introduces a dual-modality pipeline that combines pre-trained large language models (LLMs), multimodal visual language models (VLMs), and speech recognition (SR) to interpret vocal and textual input and map it to robot actions via a ROS-based execution layer. Key results show vocal command understanding accuracy of 87.55% and command execution accuracy of 86.27%, with an average latency of about 0.89 seconds, while text-based interactions achieve higher nominal accuracy; the approach demonstrates robustness to accent variation and environmental noise by enabling modality switching. The work advances practical HRI by enabling more intuitive, natural interactions and provides a path toward noise-robust, context-aware human-robot collaboration in real-world settings.

Abstract

In this paper, we extended the method proposed in [21] to enable humans to interact naturally with autonomous agents through vocal and textual conversations. Our extended method exploits the inherent capabilities of pre-trained large language models (LLMs), multimodal visual language models (VLMs), and speech recognition (SR) models to decode the high-level natural language conversations and semantic understanding of the robot's task environment, and abstract them to the robot's actionable commands or queries. We performed a quantitative evaluation of our framework's natural vocal conversation understanding with participants from different racial backgrounds and English language accents. The participants interacted with the robot using both spoken and textual instructional commands. Based on the logged interaction data, our framework achieved 87.55% vocal commands decoding accuracy, 86.27% commands execution success, and an average latency of 0.89 seconds from receiving the participants' vocal chat commands to initiating the robot's actual physical action. The video demonstrations of this paper can be found at https://linusnep.github.io/MTCC-IRoNL/.
Paper Structure (7 sections, 1 equation, 2 figures)

This paper contains 7 sections, 1 equation, 2 figures.

Figures (2)

  • Figure 1: Overview of our framework's architecture. The area enclosed with the red dotted line decodes the textual-based natural language conversations and visual understanding. In the SRNode, we employed Google's SR model stt to decode the vocal conversation from humans and abstract them to the textual representations required by the ChatGUI to interact with the LLMNode.
  • Figure 2: Quantitative evaluation results illustrating VCUA, NSR, OIA, and ART based on the logged interaction data.