Table of Contents
Fetching ...

The Conversation is the Command: Interacting with Real-World Autonomous Robot Through Natural Language

Linus Nwankwo, Elmar Rueckert

TL;DR

This paper introduces an approach that synergistically exploits the capabilities of large language models (LLMs) and multimodal vision-language models (VLMs) to enable humans to interact naturally with autonomous robots through conversational dialogue.

Abstract

In recent years, autonomous agents have surged in real-world environments such as our homes, offices, and public spaces. However, natural human-robot interaction remains a key challenge. In this paper, we introduce an approach that synergistically exploits the capabilities of large language models (LLMs) and multimodal vision-language models (VLMs) to enable humans to interact naturally with autonomous robots through conversational dialogue. We leveraged the LLMs to decode the high-level natural language instructions from humans and abstract them into precise robot actionable commands or queries. Further, we utilised the VLMs to provide a visual and semantic understanding of the robot's task environment. Our results with 99.13% command recognition accuracy and 97.96% commands execution success show that our approach can enhance human-robot interaction in real-world applications. The video demonstrations of this paper can be found at https://osf.io/wzyf6 and the code is available at our GitHub repository (https://github.com/LinusNEP/TCC_IRoNL.git).

The Conversation is the Command: Interacting with Real-World Autonomous Robot Through Natural Language

TL;DR

This paper introduces an approach that synergistically exploits the capabilities of large language models (LLMs) and multimodal vision-language models (VLMs) to enable humans to interact naturally with autonomous robots through conversational dialogue.

Abstract

In recent years, autonomous agents have surged in real-world environments such as our homes, offices, and public spaces. However, natural human-robot interaction remains a key challenge. In this paper, we introduce an approach that synergistically exploits the capabilities of large language models (LLMs) and multimodal vision-language models (VLMs) to enable humans to interact naturally with autonomous robots through conversational dialogue. We leveraged the LLMs to decode the high-level natural language instructions from humans and abstract them into precise robot actionable commands or queries. Further, we utilised the VLMs to provide a visual and semantic understanding of the robot's task environment. Our results with 99.13% command recognition accuracy and 97.96% commands execution success show that our approach can enhance human-robot interaction in real-world applications. The video demonstrations of this paper can be found at https://osf.io/wzyf6 and the code is available at our GitHub repository (https://github.com/LinusNEP/TCC_IRoNL.git).
Paper Structure (9 sections, 3 equations, 3 figures)

This paper contains 9 sections, 3 equations, 3 figures.

Figures (3)

  • Figure 1: Example demonstration of our framework. We demonstrated these results in the real world as shown in the summary video at https://osf.io/wzyf6. In (a), our framework decodes the high-level instructions such as "move in a circular pattern", "move forward, go right, etc." from humans, and abstracts them to the robot's physical actions. In (b), we leveraged our framework for the robot's task environment understanding, information requests, and goal navigation.
  • Figure 2: Overview of our framework's architecture. The LLMNode decodes the natural language conversations. The CLIPNode provides a visual and semantic understanding of the robot's task environment. The REM node abstracts the high-level understanding from the LLMNode to actual robot actions. The ChatGUI serves as the user's primary interaction point. See Subsections \ref{['gpt']}, \ref{['rem']}, and \ref{['gui']} for more details.
  • Figure 3: Performance and variability measures illustrating CRA, OIA, and NSR (top) and the participants' feedback (bottom) based on the logged interaction data.