Table of Contents
Fetching ...

DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

Zichen Jeff Cui, Hengkai Pan, Aadhithya Iyer, Siddhant Haldar, Lerrel Pinto

TL;DR

DynaMo is presented, a new in-domain, self-supervised method for learning visual representations that significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations.

Abstract

Imitation learning has proven to be a powerful tool for training complex visuomotor policies. However, current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations. A key reason for this poor data efficiency is that visual representations are predominantly either pretrained on out-of-domain data or trained directly through a behavior cloning objective. In this work, we present DynaMo, a new in-domain, self-supervised method for learning visual representations. Given a set of expert demonstrations, we jointly learn a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings, predicting the next frame in latent space, without augmentations, contrastive sampling, or access to ground truth actions. Importantly, DynaMo does not require any out-of-domain data such as Internet datasets or cross-embodied datasets. On a suite of six simulated and real environments, we show that representations learned with DynaMo significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations. Gains from using DynaMo hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP, and nearest neighbors. Finally, we ablate over key components of DynaMo and measure its impact on downstream policy performance. Robot videos are best viewed at https://dynamo-ssl.github.io

DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

TL;DR

DynaMo is presented, a new in-domain, self-supervised method for learning visual representations that significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations.

Abstract

Imitation learning has proven to be a powerful tool for training complex visuomotor policies. However, current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations. A key reason for this poor data efficiency is that visual representations are predominantly either pretrained on out-of-domain data or trained directly through a behavior cloning objective. In this work, we present DynaMo, a new in-domain, self-supervised method for learning visual representations. Given a set of expert demonstrations, we jointly learn a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings, predicting the next frame in latent space, without augmentations, contrastive sampling, or access to ground truth actions. Importantly, DynaMo does not require any out-of-domain data such as Internet datasets or cross-embodied datasets. On a suite of six simulated and real environments, we show that representations learned with DynaMo significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations. Gains from using DynaMo hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP, and nearest neighbors. Finally, we ablate over key components of DynaMo and measure its impact on downstream policy performance. Robot videos are best viewed at https://dynamo-ssl.github.io
Paper Structure (35 sections, 1 equation, 7 figures, 18 tables)

This paper contains 35 sections, 1 equation, 7 figures, 18 tables.

Figures (7)

  • Figure 1: (a) We present DynaMo, a new self-supervised method for learning visual representations for visuomotor control. DynaMo exploits the causal structure in demonstrations by jointly learning the encoder with inverse and forward dynamics models. DynaMo requires no augmentations, contrastive sampling, or access to ground truth actions. This enables downstream policy learning using limited in-domain data across simulated and real-world robotics tasks. For each environment, we pretrain the visual representation in-domain with DynaMo and learn a policy on the pretrained embeddings. (b) We provide real-world rollouts of policies learned with DynaMo representation on our multi-task xArm Kitchen and Allegro Manipulation environments.
  • Figure 2: Embedding nearest neighbor matches for DynaMo, BYOL, MoCo, and TCN on the Block Pushing environment. (Top) The nearest neighbor matches visualized in pixel space. (Bottom) Matches visualized in a top-down view. We see that the DynaMo representation captures task-relevant features (end effector, block, and target locations in this case), whereas prior work fixates on the large robot arm.
  • Figure 3: Architecture of DynaMo. DynaMo jointly learns an image encoder, an inverse dynamics model, and a forward dynamics model with a forward dynamics prediction loss.
  • Figure 4: We evaluate DynaMo on four simulated benchmarks - Franka Kitchen, Block Pushing, Push-T, and LIBERO Goal, and two real-world environments - Allegro Manipulation, and xArm Kitchen.
  • Figure 5: xArm Kitchen environment tasks
  • ...and 2 more figures