Table of Contents
Fetching ...

TEACh: Task-driven Embodied Agents that Chat

Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, Dilek Hakkani-Tur

TL;DR

TEACh introduces a large-scale dataset of human–human, dialogue-guided embodied interactions in AI2-THOR to study how natural language can ground perception and actions for household tasks. It presents an extensible Task Definition Language and three benchmarks—Execution from Dialogue History, Trajectory from Dialogue, and Two-Agent Task Completion—to evaluate Follower-only and two-agent systems. Baseline experiments using an adapted Episodic Transformer reveal strong gains over simple baselines for EDH, but highlight significant challenges posed by long-horizon dialogue grounding and two-agent coordination, with end-to-end success remaining difficult. The work provides a foundation for future few-shot generalization, improved grounding, and human-in-the-loop evaluation for conversationally guided embodied AI in realistic home settings.

Abstract

Robots operating in human spaces must be able to engage in natural language interaction with people, both understanding and executing instructions, and using conversation to resolve ambiguity and recover from mistakes. To study this, we introduce TEACh, a dataset of over 3,000 human--human, interactive dialogues to complete household tasks in simulation. A Commander with access to oracle information about a task communicates in natural language with a Follower. The Follower navigates through and interacts with the environment to complete tasks varying in complexity from "Make Coffee" to "Prepare Breakfast", asking questions and getting additional information from the Commander. We propose three benchmarks using TEACh to study embodied intelligence challenges, and we evaluate initial models' abilities in dialogue understanding, language grounding, and task execution.

TEACh: Task-driven Embodied Agents that Chat

TL;DR

TEACh introduces a large-scale dataset of human–human, dialogue-guided embodied interactions in AI2-THOR to study how natural language can ground perception and actions for household tasks. It presents an extensible Task Definition Language and three benchmarks—Execution from Dialogue History, Trajectory from Dialogue, and Two-Agent Task Completion—to evaluate Follower-only and two-agent systems. Baseline experiments using an adapted Episodic Transformer reveal strong gains over simple baselines for EDH, but highlight significant challenges posed by long-horizon dialogue grounding and two-agent coordination, with end-to-end success remaining difficult. The work provides a foundation for future few-shot generalization, improved grounding, and human-in-the-loop evaluation for conversationally guided embodied AI in realistic home settings.

Abstract

Robots operating in human spaces must be able to engage in natural language interaction with people, both understanding and executing instructions, and using conversation to resolve ambiguity and recover from mistakes. To study this, we introduce TEACh, a dataset of over 3,000 human--human, interactive dialogues to complete household tasks in simulation. A Commander with access to oracle information about a task communicates in natural language with a Follower. The Follower navigates through and interacts with the environment to complete tasks varying in complexity from "Make Coffee" to "Prepare Breakfast", asking questions and getting additional information from the Commander. We propose three benchmarks using TEACh to study embodied intelligence challenges, and we evaluate initial models' abilities in dialogue understanding, language grounding, and task execution.

Paper Structure

This paper contains 28 sections, 29 figures, 13 tables, 1 algorithm.

Figures (29)

  • Figure 1: The Commander has oracle task details (a), object locations (b), a map (c), and egocentric views from both agents. The Follower carries out the task and asks questions (d). The agents can only communicate via language.
  • Figure 2: To collect TEACh, the Commander knows the task to be completed and can query the simulator for object locations. Searched items are highlighted in green for the Commander; highlights blink to enable seeing the underlying true scene colors. The Commander has a topdown map of the scene, with the current camera position shown in red, the Follower position shown in blue, and the object search camera position shown in yellow. The Follower moves around in the environment and interacts with objects, such as placing a fork (middle). Target objects for each interaction action are highlighted.
  • Figure 3: An example task definition from the TEACh task definition language (left) and how it informs the initial simulator state and the CommanderProgress Check action. The Commander can SearchObject with a string query (right) or object instance (center) returned by the Progress Check task status, yielding a camera view, segmentation mask, and location.
  • Figure 4: Two EDH instances are constructed from this real example from the TEACh data. The first instance input contains only dialogue actions. After inference on the first instance, the agent is evaluated based on whether it moved the potato, pot, and the items cleared out of the sink to their target destinations. In this example, the pot cannot fit into the sink. The second instance input has both dialogue and environment actions, and is evaluated at inference by whether the pot lands on the stove filled with water, and whether the potato is inside the pot and boiled.
  • Figure 5: Human success rate for different tasks during data collection. Note that TEACh benchmarks only contains successful dialogue sessions, so human performance here is more a measure of how complex tasks were for annotators to complete against both coordination and simulator quirks.
  • ...and 24 more figures