Table of Contents
Fetching ...

Generalizable Imitation Learning Through Pre-Trained Representations

Wei-Di Chang, Francois Hogan, Scott Fujimoto, David Meger, Gregory Dudek

TL;DR

DVK addresses generalization in imitation learning for object manipulation by grounding policies in dense, pre-trained DINO ViT patch embeddings that are distilled into semantic keypoints. The method clusters demonstration patch features to define reference concepts, then tracks these concepts as keypoints during policy learning via Behavior Cloning, resulting in compact, transferable inputs. A Grasping Generalization Benchmark based on Google Scanned Objects and Robosuite evaluates intra-class and inter-class transfer, where DVK consistently outperforms baselines and ablations highlight the importance of the keypoint representation. Overall, the work demonstrates that stable, part-semantic representations enable zero-shot adaptation to unseen objects and provides open-source resources to advance generalization research in imitation learning.

Abstract

In this paper, we leverage self-supervised vision transformer models and their emergent semantic abilities to improve the generalization abilities of imitation learning policies. We introduce DVK, an imitation learning algorithm that leverages rich pre-trained Visual Transformer patch-level embeddings to obtain better generalization when learning through demonstrations. Our learner sees the world by clustering appearance features into groups associated with semantic concepts, forming stable keypoints that generalize across a wide range of appearance variations and object types. We demonstrate how this representation enables generalized behaviour by evaluating imitation learning across a diverse dataset of object manipulation tasks. To facilitate further study of generalization in Imitation Learning, all of our code for the method and evaluation, as well as the dataset, is made available.

Generalizable Imitation Learning Through Pre-Trained Representations

TL;DR

DVK addresses generalization in imitation learning for object manipulation by grounding policies in dense, pre-trained DINO ViT patch embeddings that are distilled into semantic keypoints. The method clusters demonstration patch features to define reference concepts, then tracks these concepts as keypoints during policy learning via Behavior Cloning, resulting in compact, transferable inputs. A Grasping Generalization Benchmark based on Google Scanned Objects and Robosuite evaluates intra-class and inter-class transfer, where DVK consistently outperforms baselines and ablations highlight the importance of the keypoint representation. Overall, the work demonstrates that stable, part-semantic representations enable zero-shot adaptation to unseen objects and provides open-source resources to advance generalization research in imitation learning.

Abstract

In this paper, we leverage self-supervised vision transformer models and their emergent semantic abilities to improve the generalization abilities of imitation learning policies. We introduce DVK, an imitation learning algorithm that leverages rich pre-trained Visual Transformer patch-level embeddings to obtain better generalization when learning through demonstrations. Our learner sees the world by clustering appearance features into groups associated with semantic concepts, forming stable keypoints that generalize across a wide range of appearance variations and object types. We demonstrate how this representation enables generalized behaviour by evaluating imitation learning across a diverse dataset of object manipulation tasks. To facilitate further study of generalization in Imitation Learning, all of our code for the method and evaluation, as well as the dataset, is made available.
Paper Structure (10 sections, 3 equations, 7 figures, 3 tables)

This paper contains 10 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: DINO ViT embeddings encode zero-shot fine-grained part-semantic information. Cosine Similarity heatmap between DINO image patch embeddings of various household objects and reference patch , all previously unseen. 6 of the 24 objects from the Google Scanned Objects datasetdowns2022google used in our manipulation experiments to test the generalization abilities of IL policies.
  • Figure 2: Overview of our feature extraction pipeline. Our approach extracts reference DINO features through K-Means clustering that represent semantic concepts from the demonstration dataset. These reference features are used in downstream policy learning, to abstract images seen in rollouts into semantic visual keypoints.
  • Figure 3: Example grasping rollouts. Our keypoints (represented by colored circles) track semantic concepts extracted from the expert demonstrations. In the shown rollouts, we display keypoints that track parts of objects such as their handles, tips and edges, parts of the end-effector, and static elements of the workspace such as the corner of the table.
  • Figure 4: Intra-Class and Inter-Class Generalization Benchmark. Our grasping benchmark examines the intra-class generalization abilities of policies through four-fold cross-validation on the Mug class, and inter-class generalization through transfer from 3 objects to 21 unseen objects of various classes.
  • Figure 5: Demonstration collection setup using a spacemouse in Robosuite. Our benchmark consists of 60 demonstration trajectories collected through human teleoperation for each training object.
  • ...and 2 more figures