Table of Contents
Fetching ...

Bidirectional Intent Communication: A Role for Large Foundation Models

Tim Schreiter, Rishi Hazra, Jens Rüppel, Andrey Rudenko

TL;DR

The paper addresses the need for user-centric human-robot interaction by proposing Bident, a multimodal, LLM-guided framework that fuses speech and gaze dynamics into planning and action. It describes a ROS2-based architecture comprising vision and audio inputs, an LLM-driven Reasoning Module, and an Action Module controlling a NAO robot, with loopback safeguards for robust interaction. The authors outline a two-stage evaluation strategy (simulation and real-robot user studies) and discuss future work to advance ARMoD deployments, improve gaze integration, and assess safety and privacy in industrial and healthcare settings. The work aims to enable seamless, context-aware HRI in shared human environments, moving beyond rigid task-centric automation toward bidirectional communication and collaboration.

Abstract

Integrating multimodal foundation models has significantly enhanced autonomous agents' language comprehension, perception, and planning capabilities. However, while existing works adopt a \emph{task-centric} approach with minimal human interaction, applying these models to developing assistive \emph{user-centric} robots that can interact and cooperate with humans remains underexplored. This paper introduces ``Bident'', a framework designed to integrate robots seamlessly into shared spaces with humans. Bident enhances the interactive experience by incorporating multimodal inputs like speech and user gaze dynamics. Furthermore, Bident supports verbal utterances and physical actions like gestures, making it versatile for bidirectional human-robot interactions. Potential applications include personalized education, where robots can adapt to individual learning styles and paces, and healthcare, where robots can offer personalized support, companionship, and everyday assistance in the home and workplace environments.

Bidirectional Intent Communication: A Role for Large Foundation Models

TL;DR

The paper addresses the need for user-centric human-robot interaction by proposing Bident, a multimodal, LLM-guided framework that fuses speech and gaze dynamics into planning and action. It describes a ROS2-based architecture comprising vision and audio inputs, an LLM-driven Reasoning Module, and an Action Module controlling a NAO robot, with loopback safeguards for robust interaction. The authors outline a two-stage evaluation strategy (simulation and real-robot user studies) and discuss future work to advance ARMoD deployments, improve gaze integration, and assess safety and privacy in industrial and healthcare settings. The work aims to enable seamless, context-aware HRI in shared human environments, moving beyond rigid task-centric automation toward bidirectional communication and collaboration.

Abstract

Integrating multimodal foundation models has significantly enhanced autonomous agents' language comprehension, perception, and planning capabilities. However, while existing works adopt a \emph{task-centric} approach with minimal human interaction, applying these models to developing assistive \emph{user-centric} robots that can interact and cooperate with humans remains underexplored. This paper introduces ``Bident'', a framework designed to integrate robots seamlessly into shared spaces with humans. Bident enhances the interactive experience by incorporating multimodal inputs like speech and user gaze dynamics. Furthermore, Bident supports verbal utterances and physical actions like gestures, making it versatile for bidirectional human-robot interactions. Potential applications include personalized education, where robots can adapt to individual learning styles and paces, and healthcare, where robots can offer personalized support, companionship, and everyday assistance in the home and workplace environments.
Paper Structure (7 sections, 1 figure)

This paper contains 7 sections, 1 figure.

Figures (1)

  • Figure 1: Bident framework for LLM informed dynamic interactions: Integrating verbal utterances and gaze (including head orientation (red) and eye-gaze direction (green)) allows an LLM to understand the situation through reasoning and generate action plans to appropriately respond to the user's input. Bident enables bidirectional communication by generating and refining plans through multimodal feedback (dotted arrow), supporting closed-loop planning in dynamic environments. Participants interact with a simulated NAO robot to test the module. Final deployment will be on an non-humanoid robot with an "Anthropomorphic Mock Up Driver (ARMoD)" schreiter2023advantages