Table of Contents
Fetching ...

SemanticScanpath: Combining Gaze and Speech for Situated Human-Robot Interaction Using LLMs

Elisabeth Menendez, Michael Gienger, Santiago Martínez, Carlos Balaguer, Anna Belardinelli

TL;DR

The paper addresses the challenge of grounding spoken language in a real physical scene for embodied robots. It introduces SemanticScanpath, a text-based representation that fuses speech transcripts with gaze history into a semantically structured prompt for an LLM-based agent. The approach integrates gaze-derived Areas-of-Interest dwell times with scene context and provides tools for querying the environment, reasoning, and acting. Across breakfast and drink tasks with multiple users and two setups, the method achieves higher disambiguation accuracy, particularly when the system can query the scene, and is demonstrated on a TIAGo++ robot.

Abstract

Large Language Models (LLMs) have substantially improved the conversational capabilities of social robots. Nevertheless, for an intuitive and fluent human-robot interaction, robots should be able to ground the conversation by relating ambiguous or underspecified spoken utterances to the current physical situation and to the intents expressed non verbally by the user, for example by using referential gaze. Here we propose a representation integrating speech and gaze to enable LLMs to obtain higher situated awareness and correctly resolve ambiguous requests. Our approach relies on a text-based semantic translation of the scanpath produced by the user along with the verbal requests and demonstrates LLM's capabilities to reason about gaze behavior, robustly ignoring spurious glances or irrelevant objects. We validate the system across multiple tasks and two scenarios, showing its generality and accuracy, and demonstrate its implementation on a robotic platform, closing the loop from request interpretation to execution.

SemanticScanpath: Combining Gaze and Speech for Situated Human-Robot Interaction Using LLMs

TL;DR

The paper addresses the challenge of grounding spoken language in a real physical scene for embodied robots. It introduces SemanticScanpath, a text-based representation that fuses speech transcripts with gaze history into a semantically structured prompt for an LLM-based agent. The approach integrates gaze-derived Areas-of-Interest dwell times with scene context and provides tools for querying the environment, reasoning, and acting. Across breakfast and drink tasks with multiple users and two setups, the method achieves higher disambiguation accuracy, particularly when the system can query the scene, and is demonstrated on a TIAGo++ robot.

Abstract

Large Language Models (LLMs) have substantially improved the conversational capabilities of social robots. Nevertheless, for an intuitive and fluent human-robot interaction, robots should be able to ground the conversation by relating ambiguous or underspecified spoken utterances to the current physical situation and to the intents expressed non verbally by the user, for example by using referential gaze. Here we propose a representation integrating speech and gaze to enable LLMs to obtain higher situated awareness and correctly resolve ambiguous requests. Our approach relies on a text-based semantic translation of the scanpath produced by the user along with the verbal requests and demonstrates LLM's capabilities to reason about gaze behavior, robustly ignoring spurious glances or irrelevant objects. We validate the system across multiple tasks and two scenarios, showing its generality and accuracy, and demonstrate its implementation on a robotic platform, closing the loop from request interpretation to execution.

Paper Structure

This paper contains 12 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Top: Example interaction, where the user looks sequentially at the robot, the cereal box, and the bowl while speaking. Bottom: Our system architecture, leveraging SemanticScanpath, a representation combining speech and gaze history to help the LLM-based agent resolve underspecified user requests by reasoning about speech, scene context, and gaze behavior.
  • Figure 2: Visualization of gaze history and speech input over time, and the corresponding semantic scanpath. The top plot shows the gaze history, capturing fixation segments with durations, while the middle plot presents speech with word-level timestamps. The bottom part presents the corresponding Semantic scanpath, which combines spoken utterances and gaze history.
  • Figure 3: Top row: Accuracy with respect to the ground truth inference in the breakfast (a) and drink (b) scenarios. When the LLM can poll the scene and assess which objects are there (speech + gaze + scene condition), it infers better what the user is referring to, compared to when the LLM just receives utterances and scanpaths (speech + gaze condition). Bottom row: gaze distribution per task (T1-T3) and scenario (breakfast (c), drink (d)) across the relative categories of objects (speech + gaze + scene condition). In cases where the system accurately inferred the user intent, the gaze dwelled primarily on the target objects, but it could to some extent endure misleading fixations on irrelevant or distractor objects. Note that T2 in both scenarios was always correctly resolved, thus only one distribution is presented.
  • Figure 4: Robot demonstration illustrating the LLM’s ability to disambiguate user requests by integrating gaze history and speech input. The user’s utterance (“Can you help me with this?”) is combined with gaze history to infer the intended task—pouring cereal from the box into the bowl. Left: Snapshots of key moments of the interaction and the robot’s execution. Right: Sequence of tool calls made by the LLM, including query tools, diagnostic tools (e.g., reasoning), expression tools (e.g., speak), and the final action tool pour_into used to command the robot.