When Robots Get Chatty: Grounding Multimodal Human-Robot Conversation and Collaboration

Philipp Allgeuer; Hassan Ali; Stefan Wermter

When Robots Get Chatty: Grounding Multimodal Human-Robot Conversation and Collaboration

Philipp Allgeuer, Hassan Ali, Stefan Wermter

TL;DR

This work tackles grounding large language models within a physical robot to enable natural, socially adept human-robot interaction. It presents NICOL, a modular ROS-based platform where the LLM coordinates multiple perception modules (open-vocabulary object detection, pose estimation, gesture detection) and a suite of robotic skills, enabling actions to be embedded directly into spoken responses. Key contributions include a ViLD-based open-vocabulary detector, a pose-based gesture detector, and an inline action mechanism that interleaves speech and robot actions in a single response, demonstrated across qualitative interactions and a 'Guess My Object' case study. Across multiple LLM backends, the study shows robust grounding and emergent social-cognitive behaviors, suggesting practical potential for naturalistic, language-driven HRI without task-specific programming.

Abstract

We investigate the use of Large Language Models (LLMs) to equip neural robotic agents with human-like social and cognitive competencies, for the purpose of open-ended human-robot conversation and collaboration. We introduce a modular and extensible methodology for grounding an LLM with the sensory perceptions and capabilities of a physical robot, and integrate multiple deep learning models throughout the architecture in a form of system integration. The integrated models encompass various functions such as speech recognition, speech generation, open-vocabulary object detection, human pose estimation, and gesture detection, with the LLM serving as the central text-based coordinating unit. The qualitative and quantitative results demonstrate the huge potential of LLMs in providing emergent cognition and interactive language-oriented control of robots in a natural and social manner.

When Robots Get Chatty: Grounding Multimodal Human-Robot Conversation and Collaboration

TL;DR

Abstract

Paper Structure (16 sections, 3 figures, 2 tables)

This paper contains 16 sections, 3 figures, 2 tables.

Introduction
Related Work
Approach
Chat Manager
Open-Vocabulary Object Detector
Chat Architecture Components
Speech Recognition
Speech Generation
Emotion Expression
Arm Manipulation and Gaze Control
Human Pose and Gesture Detection
Chat Quality and Competency
Qualitative Assessment
Chat Analysis
Case Study: Guess My Object
...and 1 more sections

Figures (3)

Figure 1: The robot uses a grounded LLM to understand and correctly react to ambiguous conversation by the user. This can include, for example, actions such as looking and pointing at objects in order to clarify its or the user's intentions.
Figure 2: Overview of the proposed grounded chat architecture.
Figure 3: NICOL's emotions: Neutral, happiness, sadness, surprise, and anger.

When Robots Get Chatty: Grounding Multimodal Human-Robot Conversation and Collaboration

TL;DR

Abstract

When Robots Get Chatty: Grounding Multimodal Human-Robot Conversation and Collaboration

Authors

TL;DR

Abstract

Table of Contents

Figures (3)