Table of Contents
Fetching ...

Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control

Vivek Myers, Andre He, Kuan Fang, Homer Walke, Philippe Hansen-Estruch, Ching-An Cheng, Mihai Jalobeanu, Andrey Kolobov, Anca Dragan, Sergey Levine

TL;DR

The paper tackles data scarcity in instruction-following robotics by introducing GRIF, which aligns language instructions with state transitions rather than static goals, enabling semi-supervised learning from large unlabeled datasets. It decouples policy learning from task representations and uses a contrastive loss to explicitly align language and transition encodings, while leveraging pre-trained vision-language models via CLIP adaptations. Empirical results in real-world tabletop manipulation show GRIF outperforms baselines and ablations, with improved grounding and generalization to unseen instructions. The approach reduces labeling requirements and suggests a practical pathway for scalable, language-driven robotic control.

Abstract

Our goal is for robots to follow natural language instructions like "put the towel next to the microwave." But getting large amounts of labeled data, i.e. data that contains demonstrations of tasks labeled with the language instruction, is prohibitive. In contrast, obtaining policies that respond to image goals is much easier, because any autonomous trial or demonstration can be labeled in hindsight with its final state as the goal. In this work, we contribute a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data. Prior work has made progress on this using vision-language models or by jointly training language-goal-conditioned policies, but so far neither method has scaled effectively to real-world robot tasks without significant human annotation. Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image, but rather to the desired change between the start and goal images that the instruction corresponds to. We then train a policy on this embedding: the policy benefits from all the unlabeled data, but the aligned embedding provides an interface for language to steer the policy. We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data. Videos and code for our approach can be found on our website: https://rail-berkeley.github.io/grif/ .

Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control

TL;DR

The paper tackles data scarcity in instruction-following robotics by introducing GRIF, which aligns language instructions with state transitions rather than static goals, enabling semi-supervised learning from large unlabeled datasets. It decouples policy learning from task representations and uses a contrastive loss to explicitly align language and transition encodings, while leveraging pre-trained vision-language models via CLIP adaptations. Empirical results in real-world tabletop manipulation show GRIF outperforms baselines and ablations, with improved grounding and generalization to unseen instructions. The approach reduces labeling requirements and suggests a practical pathway for scalable, language-driven robotic control.

Abstract

Our goal is for robots to follow natural language instructions like "put the towel next to the microwave." But getting large amounts of labeled data, i.e. data that contains demonstrations of tasks labeled with the language instruction, is prohibitive. In contrast, obtaining policies that respond to image goals is much easier, because any autonomous trial or demonstration can be labeled in hindsight with its final state as the goal. In this work, we contribute a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data. Prior work has made progress on this using vision-language models or by jointly training language-goal-conditioned policies, but so far neither method has scaled effectively to real-world robot tasks without significant human annotation. Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image, but rather to the desired change between the start and goal images that the instruction corresponds to. We then train a policy on this embedding: the policy benefits from all the unlabeled data, but the aligned embedding provides an interface for language to steer the policy. We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data. Videos and code for our approach can be found on our website: https://rail-berkeley.github.io/grif/ .
Paper Structure (26 sections, 1 equation, 6 figures, 3 tables)

This paper contains 26 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Left: Our approach learns representations of instructions that are aligned to transitions from the initial state to the goal. When commanded with instructions, the policy $\pi$ computes the task representation $z$ from the instruction and predicts the action $a$ to solve the task. Our approach is trained with a small number of labeled demonstrations and large-scale unlabeled demonstrations. Right: Our approach can solve diverse tasks and generalize to vast environment variations.
  • Figure 2: Left: We explicitly align representations between goal-conditioned and language-conditioned tasks on the labeled dataset $\DA$ through contrastive learning. Right: Given the pre-trained task representations, we train a policy on both labeled and unlabeled datasets.
  • Figure 3: Comparison of success rates $\pm \text{SE}$ between the top three methods across all trials within the three scenes. Two other baselines LCBC and R3M (not shown) achieved 0.0 success in all evaluation tasks although they do succeed on tasks that are heavily covered in the training data. Statistical significance is starred. The initial observation and instructions of each scene are shown.
  • Figure 4: Success rates of ablations with one standard error.
  • Figure 5: Left: Comparison of the top-5 text to image retrieval accuracy of representations learned by different ablations. Right: Examples of retrieved image pairs given instructions.
  • ...and 1 more figures