Situated Instruction Following

So Yeon Min; Xavi Puig; Devendra Singh Chaplot; Tsung-Yen Yang; Akshara Rai; Priyam Parashar; Ruslan Salakhutdinov; Yonatan Bisk; Roozbeh Mottaghi

Situated Instruction Following

So Yeon Min, Xavi Puig, Devendra Singh Chaplot, Tsung-Yen Yang, Akshara Rai, Priyam Parashar, Ruslan Salakhutdinov, Yonatan Bisk, Roozbeh Mottaghi

TL;DR

Situated Instruction Following (SIF) introduces a Habitat 3.0–based benchmark to evaluate how agents interpret and act on language embedded in real-world, dynamic contexts. By separating exploration and task phases and defining static, object-movement, and human-movement tasks, the dataset probes ambiguity, evolving intent, and dynamic interpretation. Two EIF-style baselines, Reasoner and Prompter, reveal that current approaches struggle to consistently ground language in changing environments and human actions, with perception and segmentation emerging as key bottlenecks. The work highlights the need for holistic, situated reasoning beyond traditional instruction-following pipelines, offering a platform to drive progress in robust, context-aware embodied agents.

Abstract

Language is never spoken in a vacuum. It is expressed, comprehended, and contextualized within the holistic backdrop of the speaker's history, actions, and environment. Since humans are used to communicating efficiently with situated language, the practicality of robotic assistants hinge on their ability to understand and act upon implicit and situated instructions. In traditional instruction following paradigms, the agent acts alone in an empty house, leading to language use that is both simplified and artificially "complete." In contrast, we propose situated instruction following, which embraces the inherent underspecification and ambiguity of real-world communication with the physical presence of a human speaker. The meaning of situated instructions naturally unfold through the past actions and the expected future behaviors of the human involved. Specifically, within our settings we have instructions that (1) are ambiguously specified, (2) have temporally evolving intent, (3) can be interpreted more precisely with the agent's dynamic actions. Our experiments indicate that state-of-the-art Embodied Instruction Following (EIF) models lack holistic understanding of situated human intention.

Situated Instruction Following

TL;DR

Abstract

Paper Structure (34 sections, 1 equation, 3 figures, 13 tables)

This paper contains 34 sections, 1 equation, 3 figures, 13 tables.

Introduction
Related Work
Embodied Instruction Following (EIF)
Text-only agents
Dataset
Tasks
Three dimensions of situated reasoning
Two axes of difficulty
Dataset Construction
Baselines
Reasoner
Semantic Mapper.
Text representation generator.
Execution Tools
Prompter
...and 19 more sections

Figures (3)

Figure 1: Situated Instruction Following. The tasks in SIF consist of two phases: an exploration phase (phase 1) and a task phase (phase 2). PnP represents a conventional static Pick-and-Place task used for comparison, wherein the environment remains unchanged after the exploration phase. S$_{hum}$ and S$_{obj}$ introduce two novel types of situated instruction following tasks. In these tasks, the objects and human subjects move during the task phase. Nuanced communication regarding these movements is provided, necessitating reasoning about ambiguous, temporally evolving, and dynamic human intent.
Figure 2: Reasoner: (a) The semantic mapper is updated at every timestep, whereas the prompt generator and planner are activated either upon completion of the last high-level action or when a new decision is required. (b) The prompt consists of system prompt, environment prompt, format prompt.
Figure 3: Text Prompt Generation of Human Trajectory: The white regions in the maps are possible regions that the human might walk towards; rooms with more than half of the area included in the white region are included in the text prompt. The red triangle is the agent position/direction, green star and dot are respectively current observed human position, anticipated human position in 10 steps. The text prompt at every 20 timesteps is given to Reasoner (and at time step 0 to Prompter which is open-loop), to decide if there is enough evidence for the clarity of the human's intent.

Situated Instruction Following

TL;DR

Abstract

Situated Instruction Following

Authors

TL;DR

Abstract

Table of Contents

Figures (3)