A Modern System Recipe for Situated Embodied Human-Robot Conversation with Real-Time Multimodal LLMs and Tool-Calling
Dong Won Lee, Sarah Gillet, Louis-Philippe Morency, Cynthia Breazeal, Hae Won Park
TL;DR
This work addresses the challenge of grounded, real-time situated interaction by pairing a real-time multimodal LM with a lightweight tool framework for attention and active perception. The system uses a two-component architecture—a streaming, real-time LM as dialogue manager and a set of function-calling tools (look_at_me, look_at_object, look_around, look_for, use_vision) governed by a geometric binding layer that maps image targets to robot gaze within SE($3$) coordinates—and a memory map to support off-camera grounding. It evaluates six home-style scenarios and four system variants, reporting both objective tool-decision correctness and subjective interaction quality, with results showing that real-time LLMs with tool use are a promising direction for practical situated embodied conversation, while recall and fine-grained perception remain challenges. The findings underscore the practical impact of tool-mediated attention on situational awareness, the importance of precise tool definitions and prompts, and the need for scalable, robot-agnostic implementations for broader deployment.
Abstract
Situated embodied conversation requires robots to interleave real-time dialogue with active perception: deciding what to look at, when to look, and what to say under tight latency constraints. We present a simple, minimal system recipe that pairs a real-time multimodal language model with a small set of tool interfaces for attention and active perception. We study six home-style scenarios that require frequent attention shifts and increasing perceptual scope. Across four system variants, we evaluate turn-level tool-decision correctness against human annotations and collect subjective ratings of interaction quality. Results indicate that real-time multimodal large language models and tool use for active perception is a promising direction for practical situated embodied conversation.
