Table of Contents
Fetching ...

Using Both Demonstrations and Language Instructions to Efficiently Learn Robotic Tasks

Albert Yu, Raymond J. Mooney

TL;DR

This work addresses the ambiguity that arises when specifying robotic tasks via demonstrations or language alone and proposes DeL-TaCo, a bimodal task-conditioning framework that jointly uses demonstrations and language during both training and testing. By encoding tasks as embeddings from a demonstration encoder and a language encoder, and fusing these within a single multi-task imitation policy, DeL-TaCo achieves better generalization to novel objects and instructions while reducing the effort required from human teachers. The key contributions include a simple, end-to-end architecture with a CLIP-style contrastive objective for demonstrations, substantial experimental gains across hundreds of tasks in simulation, and insights into how language and demonstrations complement each other to resolve ambiguity. This approach has practical implications for deployable household robots, suggesting that multimodal task specification can meaningfully reduce teaching effort and improve robustness in diverse, real-world manipulation scenarios.

Abstract

Demonstrations and natural language instructions are two common ways to specify and teach robots novel tasks. However, for many complex tasks, a demonstration or language instruction alone contains ambiguities, preventing tasks from being specified clearly. In such cases, a combination of both a demonstration and an instruction more concisely and effectively conveys the task to the robot than either modality alone. To instantiate this problem setting, we train a single multi-task policy on a few hundred challenging robotic pick-and-place tasks and propose DeL-TaCo (Joint Demo-Language Task Conditioning), a method for conditioning a robotic policy on task embeddings comprised of two components: a visual demonstration and a language instruction. By allowing these two modalities to mutually disambiguate and clarify each other during novel task specification, DeL-TaCo (1) substantially decreases the teacher effort needed to specify a new task and (2) achieves better generalization performance on novel objects and instructions over previous task-conditioning methods. To our knowledge, this is the first work to show that simultaneously conditioning a multi-task robotic manipulation policy on both demonstration and language embeddings improves sample efficiency and generalization over conditioning on either modality alone. See additional materials at https://deltaco-robot.github.io/

Using Both Demonstrations and Language Instructions to Efficiently Learn Robotic Tasks

TL;DR

This work addresses the ambiguity that arises when specifying robotic tasks via demonstrations or language alone and proposes DeL-TaCo, a bimodal task-conditioning framework that jointly uses demonstrations and language during both training and testing. By encoding tasks as embeddings from a demonstration encoder and a language encoder, and fusing these within a single multi-task imitation policy, DeL-TaCo achieves better generalization to novel objects and instructions while reducing the effort required from human teachers. The key contributions include a simple, end-to-end architecture with a CLIP-style contrastive objective for demonstrations, substantial experimental gains across hundreds of tasks in simulation, and insights into how language and demonstrations complement each other to resolve ambiguity. This approach has practical implications for deployable household robots, suggesting that multimodal task specification can meaningfully reduce teaching effort and improve robustness in diverse, real-world manipulation scenarios.

Abstract

Demonstrations and natural language instructions are two common ways to specify and teach robots novel tasks. However, for many complex tasks, a demonstration or language instruction alone contains ambiguities, preventing tasks from being specified clearly. In such cases, a combination of both a demonstration and an instruction more concisely and effectively conveys the task to the robot than either modality alone. To instantiate this problem setting, we train a single multi-task policy on a few hundred challenging robotic pick-and-place tasks and propose DeL-TaCo (Joint Demo-Language Task Conditioning), a method for conditioning a robotic policy on task embeddings comprised of two components: a visual demonstration and a language instruction. By allowing these two modalities to mutually disambiguate and clarify each other during novel task specification, DeL-TaCo (1) substantially decreases the teacher effort needed to specify a new task and (2) achieves better generalization performance on novel objects and instructions over previous task-conditioning methods. To our knowledge, this is the first work to show that simultaneously conditioning a multi-task robotic manipulation policy on both demonstration and language embeddings improves sample efficiency and generalization over conditioning on either modality alone. See additional materials at https://deltaco-robot.github.io/
Paper Structure (39 sections, 5 equations, 10 figures, 9 tables, 3 algorithms)

This paper contains 39 sections, 5 equations, 10 figures, 9 tables, 3 algorithms.

Figures (10)

  • Figure 1: DeL-TaCo Overview. Unlike current multitask methods that condition on a single task specification modality, DeL-TaCo simultaneously conditions on both language and demonstrations during training and testing to resolve any ambiguities in either task specification modality, enabling better generalization to novel tasks and significantly reducing teacher effort for specifying new tasks.
  • Figure 2: Method Architecture. DeL-TaCo uses three main networks: the policy $\pi$, a demonstration encoder $f_{demo}$, and a language encoder $f_{lang}$. During both training and testing, the policy is conditioned on the demonstration and language embeddings for the task.
  • Figure 3: Sample train and test tasks, grouped by the object identifier types (underlined in each language instruction). All 6 container identifiers are seen in both training and testing.
  • Figure 4: Train-Test Object Split. Objects are shown in raster-scan task-index order, so the object in the second row from top, second column from left, is the "bongo drum bowl", which is associated with task index 9.
  • Figure 5: Train-Test Color Split.
  • ...and 5 more figures