Natural Multimodal Fusion-Based Human-Robot Interaction: Application With Voice and Deictic Posture via Large Language Model

Yuzhi Lai; Shenghai Yuan; Youssef Nassar; Mingyu Fan; Atmaraaj Gopal; Arihiro Yorita; Naoyuki Kubota; Matthias Rätsch

Natural Multimodal Fusion-Based Human-Robot Interaction: Application With Voice and Deictic Posture via Large Language Model

Yuzhi Lai, Shenghai Yuan, Youssef Nassar, Mingyu Fan, Atmaraaj Gopal, Arihiro Yorita, Naoyuki Kubota, Matthias Rätsch

TL;DR

This work tackles intuitive HRI for elderly care by fusing voice commands with deictic postures to ground object references. It introduces NMM-HRI, a parallel multimodal framework where verbal and gestural cues are translated into object and action intents and then into executable action sequences via an LLM (GPT-4) with safety constraints. Key innovations include open-vocabulary object detection with YOLO-World, deictic POSTURE grounding, and LLM-guided, collision-checked task generation, achieving faster and more robust interactions. The approach is validated on a UR3e manipulator across diverse real-world scenarios, and the code is released as open-source for community use.

Abstract

Translating human intent into robot commands is crucial for the future of service robots in an aging society. Existing Human-Robot Interaction (HRI) systems relying on gestures or verbal commands are impractical for the elderly due to difficulties with complex syntax or sign language. To address the challenge, this paper introduces a multi-modal interaction framework that combines voice and deictic posture information to create a more natural HRI system. The visual cues are first processed by the object detection model to gain a global understanding of the environment, and then bounding boxes are estimated based on depth information. By using a large language model (LLM) with voice-to-text commands and temporally aligned selected bounding boxes, robot action sequences can be generated, while key control syntax constraints are applied to avoid potential LLM hallucination issues. The system is evaluated on real-world tasks with varying levels of complexity using a Universal Robots UR3e manipulator. Our method demonstrates significantly better performance in HRI in terms of accuracy and robustness. To benefit the research community and the general public, we will make our code and design open-source.

Natural Multimodal Fusion-Based Human-Robot Interaction: Application With Voice and Deictic Posture via Large Language Model

TL;DR

Abstract

Paper Structure (23 sections, 2 equations, 10 figures)

This paper contains 23 sections, 2 equations, 10 figures.

INTRODUCTION
RELATED WORK
PROBLEM DEFINITION
Problem Formulation
Parallel multimodal Command Sequence
Construction of Complex Command Sequence
methodology
Speech-to-Text Conversion
Object Detection
Deictic Posture Detection
Action Sequences Generation and Execution
Human-Robot Interaction
EXPERIMENTAL SETUP
Perception and Manipulator Setup
Description of Experimental Scenarios
...and 8 more sections

Figures (10)

Figure 1: Proposed voice-posture fusion HRI method has superior efficiency and requires no memorization of key syntax, which is ideal for elderly and healthcare applications. (a) Depth camera, (b) Robot manipulator, (c) Robot operating space, (d) Visual feedback, (e) User space, (f) Objects for experiment.
Figure 2: System Overview. $\mathcal{V}$ represents voice command, $\mathcal{B}$ represents human posture, $\mathcal{M}$ is mapping verbal features to action intention $I_{\mathbb{A}}$, $\mathcal{P}$ is mapping human posture and environment observation $\mathcal{S}$ into object intention $I_{\mathbb{O}}$. GPT4 decodes the multimodal commands and generates the action sequences $\mathbb{A}$. Finally, the state of end-effector $q$ is changed by the control APIs.
Figure 3: Alternative ways of finding object reference.
Figure 4: Collision-free trajectory generation.
Figure 5: The prompt is segmented into three sections: basic API constraints, action definition, and example tasks.
...and 5 more figures

Natural Multimodal Fusion-Based Human-Robot Interaction: Application With Voice and Deictic Posture via Large Language Model

TL;DR

Abstract

Natural Multimodal Fusion-Based Human-Robot Interaction: Application With Voice and Deictic Posture via Large Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (10)