Temporal Representation Alignment: Successor Features Enable Emergent Compositionality in Robot Instruction Following
Vivek Myers, Bill Chunyuan Zheng, Anca Dragan, Kuan Fang, Sergey Levine
TL;DR
Temporal Representation Alignment (TRA) introduces a time-contrastive loss to learn representations of states, goals, and language that enable compositional task execution without explicit planning or reinforcement learning. The method learns three encoders $\phi$, $\psi$, and $\xi$ and optimizes a combined loss $\mathcal{L}_{TRA}$ that includes behavioral cloning and symmetric InfoNCE alignments across time and modalities via $\mathcal{L}_{NCE}$. Empirical results on real-world BridgeData and OGBench show substantial gains in compositional generalization, with improvements over BC and offline RL baselines by more than 40% across 13 tasks and scenes. These findings suggest that structured, temporally aligned representations can yield planning-like generalization with supervised learning, reducing the need for explicit task decomposition.
Abstract
Effective task representations should facilitate compositionality, such that after learning a variety of basic tasks, an agent can perform compound tasks consisting of multiple steps simply by composing the representations of the constituent steps together. While this is conceptually simple and appealing, it is not clear how to automatically learn representations that enable this sort of compositionality. We show that learning to associate the representations of current and future states with a temporal alignment loss can improve compositional generalization, even in the absence of any explicit subtask planning or reinforcement learning. We evaluate our approach across diverse robotic manipulation tasks as well as in simulation, showing substantial improvements for tasks specified with either language or goal images.
