Table of Contents
Fetching ...

NVP-HRI: Zero Shot Natural Voice and Posture-based Human-Robot Interaction via Large Language Model

Yuzhi Lai, Shenghai Yuan, Youssef Nassar, Mingyu Fan, Thomas Weber, Matthias Rätsch

TL;DR

NVP-HRI proposes a zero-shot, multi-modal HRI framework that fuses natural voice and deictic posture to interact with unknown objects. It integrates the Segment Anything Model (SAM) for zero-shot object representation with a constrained large language model (GPT-4-turbo) to generate collision-free action sequences, aided by a cross-check using swept-volume reasoning. The system demonstrates up to 59.2%–65.2% reductions in interaction time compared with gesture-, NLP-, and VLM-based baselines, and reports strong user acceptance, including >97% preference in wide-field surveys. This approach offers practical benefits for elderly and healthcare robotics by removing the need for memorized gestures and enabling robust interaction with novel objects, with code and methodology to be open-sourced.

Abstract

Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects pose challenges for all individuals, especially elderly ones. These challenges include difficulties in recalling commands, memorizing hand gestures, and learning new names. This paper introduces NVP-HRI, an intuitive multi-modal HRI paradigm that combines voice commands and deictic posture. NVP-HRI utilizes the Segment Anything Model (SAM) to analyze visual cues and depth data, enabling precise structural object representation. Through a pre-trained SAM network, NVP-HRI allows interaction with new objects via zero-shot prediction, even without prior knowledge. NVP-HRI also integrates with a large language model (LLM) for multimodal commands, coordinating them with object selection and scene distribution in real time for collision-free trajectory solutions. We also regulate the action sequence with the essential control syntax to reduce LLM hallucination risks. The evaluation of diverse real-world tasks using a Universal Robot showcased up to 59.2\% efficiency improvement over traditional gesture control, as illustrated in the video https://youtu.be/EbC7al2wiAc. Our code and design will be openly available at https://github.com/laiyuzhi/NVP-HRI.git.

NVP-HRI: Zero Shot Natural Voice and Posture-based Human-Robot Interaction via Large Language Model

TL;DR

NVP-HRI proposes a zero-shot, multi-modal HRI framework that fuses natural voice and deictic posture to interact with unknown objects. It integrates the Segment Anything Model (SAM) for zero-shot object representation with a constrained large language model (GPT-4-turbo) to generate collision-free action sequences, aided by a cross-check using swept-volume reasoning. The system demonstrates up to 59.2%–65.2% reductions in interaction time compared with gesture-, NLP-, and VLM-based baselines, and reports strong user acceptance, including >97% preference in wide-field surveys. This approach offers practical benefits for elderly and healthcare robotics by removing the need for memorized gestures and enabling robust interaction with novel objects, with code and methodology to be open-sourced.

Abstract

Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects pose challenges for all individuals, especially elderly ones. These challenges include difficulties in recalling commands, memorizing hand gestures, and learning new names. This paper introduces NVP-HRI, an intuitive multi-modal HRI paradigm that combines voice commands and deictic posture. NVP-HRI utilizes the Segment Anything Model (SAM) to analyze visual cues and depth data, enabling precise structural object representation. Through a pre-trained SAM network, NVP-HRI allows interaction with new objects via zero-shot prediction, even without prior knowledge. NVP-HRI also integrates with a large language model (LLM) for multimodal commands, coordinating them with object selection and scene distribution in real time for collision-free trajectory solutions. We also regulate the action sequence with the essential control syntax to reduce LLM hallucination risks. The evaluation of diverse real-world tasks using a Universal Robot showcased up to 59.2\% efficiency improvement over traditional gesture control, as illustrated in the video https://youtu.be/EbC7al2wiAc. Our code and design will be openly available at https://github.com/laiyuzhi/NVP-HRI.git.

Paper Structure

This paper contains 27 sections, 7 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Proposed voice-posture fusion HRI method has superior efficiency in manipulating untrained objects and requires no memorization of key syntax, which is ideal for elderly and healthcare applications.
  • Figure 2: System Overview. $\mathcal{V}$ represents verbal command, $\mathcal{B}$ represents posture references, $\mathcal{L}$ is mapping vocal features to text, $\Gamma$ represent the 3D cluster of the scene object, $\mathcal{M}$ is the mapping to get the target intention, $\mathcal{A}$ is the mapping to get the action sequences.
  • Figure 3: Input and Output of SAM: Out of three objects, only two have correct semantic meanings. However, all of them are segmented correctly.
  • Figure 4: The prompt is divided into action constraints, trajectory constraints, and example tasks, followed by a cross-check. Cross-check results are fed back to the LLM, and if a collision is detected, a new trajectory is generated. Only sequences that pass the cross-check are executed by the robot, resembling a closed-loop control system as in classical control theory.
  • Figure 5: Typical gestures utilized within gesture-based HRI system hanggesture with their respective verbal commands.
  • ...and 5 more figures