Table of Contents
Fetching ...

Semi-Supervised One-Shot Imitation Learning

Philipp Wu, Kourosh Hakhamaneshi, Yuqing Du, Igor Mordatch, Aravind Rajeswaran, Pieter Abbeel

TL;DR

This work addresses the data-efficiency bottleneck of One-Shot Imitation Learning (OSIL) by introducing semi-supervised OSIL, which leverages a large unlabeled trajectory dataset alongside a small labeled, task-paired set. The authors propose a teacher-student framework where a teacher encoder is trained on labeled data to structure a latent trajectory space, enabling pseudo-labeling of unlabeled trajectories via $k$-nearest neighbors in embedding space. A student OSIL policy is then trained on both real and pseudo-labeled data, with iterative relabeling to progressively improve labels and policy performance. Experiments on semantic and sequential navigation tasks show that the semi-supervised approach can match or closely approach fully supervised OSIL performance with a fraction of the labeled data, while maintaining high trajectory retrieval quality, highlighting substantial improvements in label efficiency for OSIL.

Abstract

One-shot Imitation Learning~(OSIL) aims to imbue AI agents with the ability to learn a new task from a single demonstration. To supervise the learning, OSIL typically requires a prohibitively large number of paired expert demonstrations -- i.e. trajectories corresponding to different variations of the same semantic task. To overcome this limitation, we introduce the semi-supervised OSIL problem setting, where the learning agent is presented with a large dataset of trajectories with no task labels (i.e. an unpaired dataset), along with a small dataset of multiple demonstrations per semantic task (i.e. a paired dataset). This presents a more realistic and practical embodiment of few-shot learning and requires the agent to effectively leverage weak supervision from a large dataset of trajectories. Subsequently, we develop an algorithm specifically applicable to this semi-supervised OSIL setting. Our approach first learns an embedding space where different tasks cluster uniquely. We utilize this embedding space and the clustering it supports to self-generate pairings between trajectories in the large unpaired dataset. Through empirical results on simulated control tasks, we demonstrate that OSIL models trained on such self-generated pairings are competitive with OSIL models trained with ground-truth labels, presenting a major advancement in the label-efficiency of OSIL.

Semi-Supervised One-Shot Imitation Learning

TL;DR

This work addresses the data-efficiency bottleneck of One-Shot Imitation Learning (OSIL) by introducing semi-supervised OSIL, which leverages a large unlabeled trajectory dataset alongside a small labeled, task-paired set. The authors propose a teacher-student framework where a teacher encoder is trained on labeled data to structure a latent trajectory space, enabling pseudo-labeling of unlabeled trajectories via -nearest neighbors in embedding space. A student OSIL policy is then trained on both real and pseudo-labeled data, with iterative relabeling to progressively improve labels and policy performance. Experiments on semantic and sequential navigation tasks show that the semi-supervised approach can match or closely approach fully supervised OSIL performance with a fraction of the labeled data, while maintaining high trajectory retrieval quality, highlighting substantial improvements in label efficiency for OSIL.

Abstract

One-shot Imitation Learning~(OSIL) aims to imbue AI agents with the ability to learn a new task from a single demonstration. To supervise the learning, OSIL typically requires a prohibitively large number of paired expert demonstrations -- i.e. trajectories corresponding to different variations of the same semantic task. To overcome this limitation, we introduce the semi-supervised OSIL problem setting, where the learning agent is presented with a large dataset of trajectories with no task labels (i.e. an unpaired dataset), along with a small dataset of multiple demonstrations per semantic task (i.e. a paired dataset). This presents a more realistic and practical embodiment of few-shot learning and requires the agent to effectively leverage weak supervision from a large dataset of trajectories. Subsequently, we develop an algorithm specifically applicable to this semi-supervised OSIL setting. Our approach first learns an embedding space where different tasks cluster uniquely. We utilize this embedding space and the clustering it supports to self-generate pairings between trajectories in the large unpaired dataset. Through empirical results on simulated control tasks, we demonstrate that OSIL models trained on such self-generated pairings are competitive with OSIL models trained with ground-truth labels, presenting a major advancement in the label-efficiency of OSIL.
Paper Structure (26 sections, 4 equations, 5 figures, 3 tables)

This paper contains 26 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (Left) Depiction of the supervised (classical) OSIL setting, where the encoder and policy are trained using several trajectories ($d$) sharing the same task label ($\bm{t}$). (Right) Our semi-supervised OSIL setting instead requires only a large unlabelled dataset of trajectories, and a small paired dataset. For our method, a teacher trajectory encoder is first trained using the labeled dataset. This encoder is then used to construct a pseudo-paired trajectory set by retrieving the $k$ nearest neighbors of each trajectory. We can then train a student on this pseudo-labeled dataset, as in supervised OSIL. Optionally, this relabelling and training procedure can be repeated iteratively.
  • Figure 2: The architecture used in our algorithm. (a) shows the generic structure of a OSIL agent, which consists of a generic demonstration encoder $f_\phi$ and the $\pi_\theta(a_t|s_t, z)$ task latent conditioned policy, which comprises of image encoder.. (b) shows one potential instantiation of the demonstration encoder, which leverages a bi direction transformer to encode the trajectory. This is used for the pinpad sequential navigation task, which requires reasoning over the entire trajectory.
  • Figure 3: Sample goals and corresponding demonstration visualizations for the two tasks.
  • Figure 4: Task success rates for the Semantic Goal Navigation Task.
  • Figure 5: TSNE visualizations of the learned embeddings where the only 15% of the dataset is labeled. (a) shows the embedding trained with imitation loss only. (b) adds the contrastive loss on the labeled subset of data (c) additionally adds a self supervised loss with images augmentations.