LaMI: Large Language Models for Multi-Modal Human-Robot Interaction

Chao Wang; Stephan Hasler; Daniel Tanneberg; Felix Ocker; Frank Joublin; Antonello Ceravola; Joerg Deigmoeller; Michael Gienger

LaMI: Large Language Models for Multi-Modal Human-Robot Interaction

Chao Wang, Stephan Hasler, Daniel Tanneberg, Felix Ocker, Frank Joublin, Antonello Ceravola, Joerg Deigmoeller, Michael Gienger

TL;DR

The paper addresses the bottleneck of manual state-and-flow design in multi-modal HRI by proposing an LLM-driven framework composed of Scene Narrator, Planner, and Expresser. It demonstrates integration on a physical robot, translating multi-modal inputs into natural-language reasoning and coordinating speech and expressive gestures through a GPT-4 Tool API–driven pipeline. The approach couples high-level guidance with atomic actions, atomic motion clips, and embedded examples, while employing rule-based reactive expressions to mitigate latency and maintain responsiveness. Preliminary results show the method can adapt to dynamic scenes and convey rich social interaction, with future work planned to benchmark against rule-based baselines and to study anthropomorphism and workload through user studies.

Abstract

This paper presents an innovative large language model (LLM)-based robotic system for enhancing multi-modal human-robot interaction (HRI). Traditional HRI systems relied on complex designs for intent estimation, reasoning, and behavior generation, which were resource-intensive. In contrast, our system empowers researchers and practitioners to regulate robot behavior through three key aspects: providing high-level linguistic guidance, creating "atomic actions" and expressions the robot can use, and offering a set of examples. Implemented on a physical robot, it demonstrates proficiency in adapting to multi-modal inputs and determining the appropriate manner of action to assist humans with its arms, following researchers' defined guidelines. Simultaneously, it coordinates the robot's lid, neck, and ear movements with speech output to produce dynamic, multi-modal expressions. This showcases the system's potential to revolutionize HRI by shifting from conventional, manual state-and-flow design methods to an intuitive, guidance-based, and example-driven approach. Supplementary material can be found at https://hri-eu.github.io/Lami/

LaMI: Large Language Models for Multi-Modal Human-Robot Interaction

TL;DR

Abstract

Paper Structure (16 sections, 5 figures)

This paper contains 16 sections, 5 figures.

Introduction
LLM driven Human-Robot Interaction
The "Scene Narrator"
The "Planner"
The "Expresser"
Interaction Flow: An Example
Configuration space for human-robot interaction
Evaluation Setup
Test Scenario
Preliminary test result and lesson learned
Conclusions and Future work
Guidance and Function Descriptions
System Prompt
Some of the Callable functions and Descriptions
Examples of robot facial expression
...and 1 more sections

Figures (5)

Figure 1: Robot's Hardware and the Scenario Setup
Figure 2: The system structure.
Figure 3: The GUI illustrates the robot's "internal thoughts" by translating GPT-called functions and their outcomes into natural language, accompanied by relevant icons. Additionally, after each GPT query cycle, the LLM is prompted to summarize the reasoning behind its actions.
Figure 4: The interaction flow. The blue square are the action generated by the LLM; the grey ones are rule-based function.
Figure 5: creating atomic animation clips

LaMI: Large Language Models for Multi-Modal Human-Robot Interaction

TL;DR

Abstract

LaMI: Large Language Models for Multi-Modal Human-Robot Interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (5)