Table of Contents
Fetching ...

When Robots Get Chatty: Grounding Multimodal Human-Robot Conversation and Collaboration

Philipp Allgeuer, Hassan Ali, Stefan Wermter

TL;DR

This work tackles grounding large language models within a physical robot to enable natural, socially adept human-robot interaction. It presents NICOL, a modular ROS-based platform where the LLM coordinates multiple perception modules (open-vocabulary object detection, pose estimation, gesture detection) and a suite of robotic skills, enabling actions to be embedded directly into spoken responses. Key contributions include a ViLD-based open-vocabulary detector, a pose-based gesture detector, and an inline action mechanism that interleaves speech and robot actions in a single response, demonstrated across qualitative interactions and a 'Guess My Object' case study. Across multiple LLM backends, the study shows robust grounding and emergent social-cognitive behaviors, suggesting practical potential for naturalistic, language-driven HRI without task-specific programming.

Abstract

We investigate the use of Large Language Models (LLMs) to equip neural robotic agents with human-like social and cognitive competencies, for the purpose of open-ended human-robot conversation and collaboration. We introduce a modular and extensible methodology for grounding an LLM with the sensory perceptions and capabilities of a physical robot, and integrate multiple deep learning models throughout the architecture in a form of system integration. The integrated models encompass various functions such as speech recognition, speech generation, open-vocabulary object detection, human pose estimation, and gesture detection, with the LLM serving as the central text-based coordinating unit. The qualitative and quantitative results demonstrate the huge potential of LLMs in providing emergent cognition and interactive language-oriented control of robots in a natural and social manner.

When Robots Get Chatty: Grounding Multimodal Human-Robot Conversation and Collaboration

TL;DR

This work tackles grounding large language models within a physical robot to enable natural, socially adept human-robot interaction. It presents NICOL, a modular ROS-based platform where the LLM coordinates multiple perception modules (open-vocabulary object detection, pose estimation, gesture detection) and a suite of robotic skills, enabling actions to be embedded directly into spoken responses. Key contributions include a ViLD-based open-vocabulary detector, a pose-based gesture detector, and an inline action mechanism that interleaves speech and robot actions in a single response, demonstrated across qualitative interactions and a 'Guess My Object' case study. Across multiple LLM backends, the study shows robust grounding and emergent social-cognitive behaviors, suggesting practical potential for naturalistic, language-driven HRI without task-specific programming.

Abstract

We investigate the use of Large Language Models (LLMs) to equip neural robotic agents with human-like social and cognitive competencies, for the purpose of open-ended human-robot conversation and collaboration. We introduce a modular and extensible methodology for grounding an LLM with the sensory perceptions and capabilities of a physical robot, and integrate multiple deep learning models throughout the architecture in a form of system integration. The integrated models encompass various functions such as speech recognition, speech generation, open-vocabulary object detection, human pose estimation, and gesture detection, with the LLM serving as the central text-based coordinating unit. The qualitative and quantitative results demonstrate the huge potential of LLMs in providing emergent cognition and interactive language-oriented control of robots in a natural and social manner.
Paper Structure (16 sections, 3 figures, 2 tables)

This paper contains 16 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The robot uses a grounded LLM to understand and correctly react to ambiguous conversation by the user. This can include, for example, actions such as looking and pointing at objects in order to clarify its or the user's intentions.
  • Figure 2: Overview of the proposed grounded chat architecture.
  • Figure 3: NICOL's emotions: Neutral, happiness, sadness, surprise, and anger.