A Modern System Recipe for Situated Embodied Human-Robot Conversation with Real-Time Multimodal LLMs and Tool-Calling

Dong Won Lee; Sarah Gillet; Louis-Philippe Morency; Cynthia Breazeal; Hae Won Park

A Modern System Recipe for Situated Embodied Human-Robot Conversation with Real-Time Multimodal LLMs and Tool-Calling

Dong Won Lee, Sarah Gillet, Louis-Philippe Morency, Cynthia Breazeal, Hae Won Park

TL;DR

This work addresses the challenge of grounded, real-time situated interaction by pairing a real-time multimodal LM with a lightweight tool framework for attention and active perception. The system uses a two-component architecture—a streaming, real-time LM as dialogue manager and a set of function-calling tools (look_at_me, look_at_object, look_around, look_for, use_vision) governed by a geometric binding layer that maps image targets to robot gaze within SE($3$) coordinates—and a memory map to support off-camera grounding. It evaluates six home-style scenarios and four system variants, reporting both objective tool-decision correctness and subjective interaction quality, with results showing that real-time LLMs with tool use are a promising direction for practical situated embodied conversation, while recall and fine-grained perception remain challenges. The findings underscore the practical impact of tool-mediated attention on situational awareness, the importance of precise tool definitions and prompts, and the need for scalable, robot-agnostic implementations for broader deployment.

Abstract

Situated embodied conversation requires robots to interleave real-time dialogue with active perception: deciding what to look at, when to look, and what to say under tight latency constraints. We present a simple, minimal system recipe that pairs a real-time multimodal language model with a small set of tool interfaces for attention and active perception. We study six home-style scenarios that require frequent attention shifts and increasing perceptual scope. Across four system variants, we evaluate turn-level tool-decision correctness against human annotations and collect subjective ratings of interaction quality. Results indicate that real-time multimodal large language models and tool use for active perception is a promising direction for practical situated embodied conversation.

A Modern System Recipe for Situated Embodied Human-Robot Conversation with Real-Time Multimodal LLMs and Tool-Calling

TL;DR

) coordinates—and a memory map to support off-camera grounding. It evaluates six home-style scenarios and four system variants, reporting both objective tool-decision correctness and subjective interaction quality, with results showing that real-time LLMs with tool use are a promising direction for practical situated embodied conversation, while recall and fine-grained perception remain challenges. The findings underscore the practical impact of tool-mediated attention on situational awareness, the importance of precise tool definitions and prompts, and the need for scalable, robot-agnostic implementations for broader deployment.

Abstract

Paper Structure (31 sections, 6 equations, 3 figures, 4 tables, 4 algorithms)

This paper contains 31 sections, 6 equations, 3 figures, 4 tables, 4 algorithms.

Introduction
Related Work
Situated Human-Robot Conversations
Real-Time Multimodal Large Language Models (LLMs)
LLM Tool-Use and Function Calling
LLM integration for Human Robot Embodied Conversation
Methods
Real-time multimodal LM for streaming dialogue and visual grounding
Function calling for robot attention control
Experiments
Interaction Scenarios
System Ablations
Human Annotations: System Correctness Measures
Overall Interaction Quality Measures
Results
...and 16 more sections

Figures (3)

Figure 1: Overview of our real-time situated conversation system. Streaming egocentric vision and audio are processed by a real-time multimodal LM, which (i) generates spoken dialogue and (ii) issues low-latency function calls to external tools for attention and active perception (e.g., Look_at_Person, Look_at_Object, Look_Around, Look_For, Use_Vision). Tool outputs update the shared perceptual context and drive robot gaze.
Figure 2: look_around performs a sweep to acquire and store egocentric views with associated robot poses, forming a lightweight view-memory. Given a language query (e.g., “Where is a good place to place a lamp?”), look_for searches over stored views using concurrent VLM/object-detector calls and returns the best-matching image evidence along with the corresponding robot orientation/pose for action.
Figure 3: Example interaction illustrating turn-level “what should the robot do next?” tool-decision evaluation in an outfit-selection scenario. As the user’s dialogue evolves (finding jackets → trying one on → asking about a specific brown jacket), the real-time system interleaves spoken responses with function calls. Human annotations specify the expected next perception action at each turn, enabling direct comparison to the executed tool calls.

A Modern System Recipe for Situated Embodied Human-Robot Conversation with Real-Time Multimodal LLMs and Tool-Calling

TL;DR

Abstract

A Modern System Recipe for Situated Embodied Human-Robot Conversation with Real-Time Multimodal LLMs and Tool-Calling

Authors

TL;DR

Abstract

Table of Contents

Figures (3)