Table of Contents
Fetching ...

Temporal Representation Alignment: Successor Features Enable Emergent Compositionality in Robot Instruction Following

Vivek Myers, Bill Chunyuan Zheng, Anca Dragan, Kuan Fang, Sergey Levine

TL;DR

Temporal Representation Alignment (TRA) introduces a time-contrastive loss to learn representations of states, goals, and language that enable compositional task execution without explicit planning or reinforcement learning. The method learns three encoders $\phi$, $\psi$, and $\xi$ and optimizes a combined loss $\mathcal{L}_{TRA}$ that includes behavioral cloning and symmetric InfoNCE alignments across time and modalities via $\mathcal{L}_{NCE}$. Empirical results on real-world BridgeData and OGBench show substantial gains in compositional generalization, with improvements over BC and offline RL baselines by more than 40% across 13 tasks and scenes. These findings suggest that structured, temporally aligned representations can yield planning-like generalization with supervised learning, reducing the need for explicit task decomposition.

Abstract

Effective task representations should facilitate compositionality, such that after learning a variety of basic tasks, an agent can perform compound tasks consisting of multiple steps simply by composing the representations of the constituent steps together. While this is conceptually simple and appealing, it is not clear how to automatically learn representations that enable this sort of compositionality. We show that learning to associate the representations of current and future states with a temporal alignment loss can improve compositional generalization, even in the absence of any explicit subtask planning or reinforcement learning. We evaluate our approach across diverse robotic manipulation tasks as well as in simulation, showing substantial improvements for tasks specified with either language or goal images.

Temporal Representation Alignment: Successor Features Enable Emergent Compositionality in Robot Instruction Following

TL;DR

Temporal Representation Alignment (TRA) introduces a time-contrastive loss to learn representations of states, goals, and language that enable compositional task execution without explicit planning or reinforcement learning. The method learns three encoders , , and and optimizes a combined loss that includes behavioral cloning and symmetric InfoNCE alignments across time and modalities via . Empirical results on real-world BridgeData and OGBench show substantial gains in compositional generalization, with improvements over BC and offline RL baselines by more than 40% across 13 tasks and scenes. These findings suggest that structured, temporally aligned representations can yield planning-like generalization with supervised learning, reducing the need for explicit task decomposition.

Abstract

Effective task representations should facilitate compositionality, such that after learning a variety of basic tasks, an agent can perform compound tasks consisting of multiple steps simply by composing the representations of the constituent steps together. While this is conceptually simple and appealing, it is not clear how to automatically learn representations that enable this sort of compositionality. We show that learning to associate the representations of current and future states with a temporal alignment loss can improve compositional generalization, even in the absence of any explicit subtask planning or reinforcement learning. We evaluate our approach across diverse robotic manipulation tasks as well as in simulation, showing substantial improvements for tasks specified with either language or goal images.

Paper Structure

This paper contains 42 sections, 2 theorems, 20 equations, 9 figures, 5 tables, 1 algorithm.

Key Result

theorem 1

Suppose $\cD$ is distributed according to eq:dataset_distribution and $\cD^{*}$ is distributed according to eq:dataset_distribution. When $\gamma > 1-1/H$ and $\alpha > 1$, for optimal features $\phi$ and $\psi$ under eq:overall_objective, we have

Figures (9)

  • Figure 1: We show our Temporal Representation Alignment () method performing a language task, "put all food items in the bowl." adds a time-contrastive loss for learning task representations to use with a goal- and language-conditioned policy. While can implicitly decompose the task into steps and execute them one by one, the behavioral cloning (BC) and offline RL (AWR) methods fail at this compositional task. The structured representations learned by enable this compositional behavior without explicit planning or hierarchical structure.
  • Figure 2: Aggregated performance on compositional generalization tasks, consisting of instruction-following and goal-reaching tasks.
  • Figure 3: The tabletop manipulation setup used for the real-world evaluation (see walke2023bridgedata).
  • Figure 4: Example rollouts of a task with and LCBC. While is able to successfully compose the steps to complete the task, LCBC fails to ground the instruction correctly.
  • Figure 5: Aggregated success rate of using AWR as an additional policy learning metric over all 4 scenes.
  • ...and 4 more figures

Theorems & Definitions (3)

  • theorem 1
  • corollary 1
  • proof