Generalizable Imitation Learning Through Pre-Trained Representations
Wei-Di Chang, Francois Hogan, Scott Fujimoto, David Meger, Gregory Dudek
TL;DR
DVK addresses generalization in imitation learning for object manipulation by grounding policies in dense, pre-trained DINO ViT patch embeddings that are distilled into semantic keypoints. The method clusters demonstration patch features to define reference concepts, then tracks these concepts as keypoints during policy learning via Behavior Cloning, resulting in compact, transferable inputs. A Grasping Generalization Benchmark based on Google Scanned Objects and Robosuite evaluates intra-class and inter-class transfer, where DVK consistently outperforms baselines and ablations highlight the importance of the keypoint representation. Overall, the work demonstrates that stable, part-semantic representations enable zero-shot adaptation to unseen objects and provides open-source resources to advance generalization research in imitation learning.
Abstract
In this paper, we leverage self-supervised vision transformer models and their emergent semantic abilities to improve the generalization abilities of imitation learning policies. We introduce DVK, an imitation learning algorithm that leverages rich pre-trained Visual Transformer patch-level embeddings to obtain better generalization when learning through demonstrations. Our learner sees the world by clustering appearance features into groups associated with semantic concepts, forming stable keypoints that generalize across a wide range of appearance variations and object types. We demonstrate how this representation enables generalized behaviour by evaluating imitation learning across a diverse dataset of object manipulation tasks. To facilitate further study of generalization in Imitation Learning, all of our code for the method and evaluation, as well as the dataset, is made available.
