SemanticScanpath: Combining Gaze and Speech for Situated Human-Robot Interaction Using LLMs
Elisabeth Menendez, Michael Gienger, Santiago Martínez, Carlos Balaguer, Anna Belardinelli
TL;DR
The paper addresses the challenge of grounding spoken language in a real physical scene for embodied robots. It introduces SemanticScanpath, a text-based representation that fuses speech transcripts with gaze history into a semantically structured prompt for an LLM-based agent. The approach integrates gaze-derived Areas-of-Interest dwell times with scene context and provides tools for querying the environment, reasoning, and acting. Across breakfast and drink tasks with multiple users and two setups, the method achieves higher disambiguation accuracy, particularly when the system can query the scene, and is demonstrated on a TIAGo++ robot.
Abstract
Large Language Models (LLMs) have substantially improved the conversational capabilities of social robots. Nevertheless, for an intuitive and fluent human-robot interaction, robots should be able to ground the conversation by relating ambiguous or underspecified spoken utterances to the current physical situation and to the intents expressed non verbally by the user, for example by using referential gaze. Here we propose a representation integrating speech and gaze to enable LLMs to obtain higher situated awareness and correctly resolve ambiguous requests. Our approach relies on a text-based semantic translation of the scanpath produced by the user along with the verbal requests and demonstrates LLM's capabilities to reason about gaze behavior, robustly ignoring spurious glances or irrelevant objects. We validate the system across multiple tasks and two scenarios, showing its generality and accuracy, and demonstrate its implementation on a robotic platform, closing the loop from request interpretation to execution.
