Pragmatic Embodied Spoken Instruction Following in Human-Robot Collaboration with Theory of Mind

Lance Ying; Xinyi Li; Shivam Aarya; Yizirui Fang; Yifan Yin; Jason Xinyu Liu; Stefanie Tellex; Joshua B. Tenenbaum; Tianmin Shu

Pragmatic Embodied Spoken Instruction Following in Human-Robot Collaboration with Theory of Mind

Lance Ying, Xinyi Li, Shivam Aarya, Yizirui Fang, Yifan Yin, Jason Xinyu Liu, Stefanie Tellex, Joshua B. Tenenbaum, Tianmin Shu

TL;DR

This work tackles robust instruction following in noisy human–robot collaboration by introducing SIFToM, a neurosymbolic framework that grounds multimodal inputs with a Vision-Language Model and then applies Theory of Mind-based probabilistic inference to infer user intent. It formulates the task as a two-agent POMDP and uses two likelihoods—action and instruction—to jointly infer the intended goal and plan. Empirical evaluation in both simulated (UnclearInstruct in VirtualHome) and real-world (Stretch robot in a kitchen) settings shows that SIFToM outperforms strong VLM baselines and approaches human-level accuracy, with notable gains in speed and reliability. The results highlight the value of pragmatic reasoning for robust and trustworthy embodied AI, while pointing to grounding fidelity as a key bottleneck and a focus for future work in more complex, real-world contexts.

Abstract

Spoken language instructions are ubiquitous in agent collaboration. However, in real-world human-robot collaboration, following human spoken instructions can be challenging due to various speaker and environmental factors, such as background noise or mispronunciation. When faced with noisy auditory inputs, humans can leverage the collaborative context in the embodied environment to interpret noisy spoken instructions and take pragmatic assistive actions. In this paper, we present a cognitively inspired neurosymbolic model, Spoken Instruction Following through Theory of Mind (SIFToM), which leverages a Vision-Language Model with model-based mental inference to enable robots to pragmatically follow human instructions under diverse speech conditions. We test SIFToM in both simulated environments (VirtualHome) and real-world human-robot collaborative settings with human evaluations. Results show that SIFToM can significantly improve the performance of a lightweight base VLM (Gemini 2.5 Flash), outperforming state-of-the-art VLMs (Gemini 2.5 Pro) and approaching human-level accuracy on challenging spoken instruction following tasks.

Pragmatic Embodied Spoken Instruction Following in Human-Robot Collaboration with Theory of Mind

TL;DR

Abstract

Paper Structure (33 sections, 3 equations, 7 figures, 1 table)

This paper contains 33 sections, 3 equations, 7 figures, 1 table.

Introduction
Related Work
Vision Language Model for Human Robot Collaboration
Language Grounding in Embodied Interactions
Theory of Mind for Cooperative Robot Planning
Methods
Problem Formulation
Spoken Instruction Following through Theory of Mind (SIFToM)
Model Overview
Generating Symbolic Representations
Probabilistic Goal and Plan Inference
Action Likelihood
Instruction Likelihood
Simulated Experiment
Dataset Construction
...and 18 more sections

Figures (7)

Figure 1: Example scenario where the human is asking for the tomato to make a salad. State-of-the-art automatic speech recognition (ASR) models or Vision-Language Models often cannot decode whether the human said tomato or potato due to noise or mispronunciation. However, SIFToM can make a pragmatic inference that the human is making a salad and is likely asking for a tomato.
Figure 2: Illustration of the SIFToM architecture for robust instruction following under noise. The system processes raw visual and speech inputs using a Vision-Language Model (VLM) to generate a structured, symbolic representation of the scene, human actions, and transcribed speech. This symbolic data then feeds into a probabilistic inference module that reasons over multiple goal hypotheses to find the most likely human intent. In the example shown, a person retrieves milk and a bowl in the kitchen, then gives a verbal instruction to the robot assistant: "Can you pass me the cereal?". However, due to background noise, the speech is not transcribed accurately by the VLM model, and the model instead outputs "silver" instead of "cereal". In this example, the model cannot infer the true command based on the visual observation or speech alone: the actions of getting milk and a bowl could suggest the need for cereal or oatmeal. Based on the speech alone, the model finds sugar and cereal to be likely candidates. However, by integrating the two likelihoods, the model is able to infer the true instruction for getting the cereal.
Figure 3: An illustration of the VirtualHome simulator where a main agent and an assistant collaborate on a household task.
Figure 4: Model accuracy and speedup performance. Overall, SIFToM achieved the best performance among all models and approached human performance. Error bars indicate 95% confidence interval bootstrapped from 1000 samples.
Figure 5: Histograms showing the distribution of speedup metrics across models.
...and 2 more figures

Pragmatic Embodied Spoken Instruction Following in Human-Robot Collaboration with Theory of Mind

TL;DR

Abstract

Pragmatic Embodied Spoken Instruction Following in Human-Robot Collaboration with Theory of Mind

Authors

TL;DR

Abstract

Table of Contents

Figures (7)