DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

Zichen Jeff Cui; Hengkai Pan; Aadhithya Iyer; Siddhant Haldar; Lerrel Pinto

DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

Zichen Jeff Cui, Hengkai Pan, Aadhithya Iyer, Siddhant Haldar, Lerrel Pinto

TL;DR

DynaMo is presented, a new in-domain, self-supervised method for learning visual representations that significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations.

Abstract

Imitation learning has proven to be a powerful tool for training complex visuomotor policies. However, current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations. A key reason for this poor data efficiency is that visual representations are predominantly either pretrained on out-of-domain data or trained directly through a behavior cloning objective. In this work, we present DynaMo, a new in-domain, self-supervised method for learning visual representations. Given a set of expert demonstrations, we jointly learn a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings, predicting the next frame in latent space, without augmentations, contrastive sampling, or access to ground truth actions. Importantly, DynaMo does not require any out-of-domain data such as Internet datasets or cross-embodied datasets. On a suite of six simulated and real environments, we show that representations learned with DynaMo significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations. Gains from using DynaMo hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP, and nearest neighbors. Finally, we ablate over key components of DynaMo and measure its impact on downstream policy performance. Robot videos are best viewed at https://dynamo-ssl.github.io

DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

TL;DR

Abstract

Paper Structure (35 sections, 1 equation, 7 figures, 18 tables)

This paper contains 35 sections, 1 equation, 7 figures, 18 tables.

Introduction
Background
Visual imitation learning
Visual pretraining for policy learning
DynaMo
Limitations of prior self-supervised techniques:
Overview of DynaMo:
Dynamics as a visual self-supervised learning objective
Experiments
Environments and datasets
Does DynaMo improve downstream policy performance?
Do representations trained with DynaMo work on real robotic tasks?
Is DynaMo compatible with different policy classes?
Can pretrained weights be fine-tuned in domain with DynaMo?
How important is each component in DynaMo?
...and 20 more sections

Figures (7)

Figure 1: (a) We present DynaMo, a new self-supervised method for learning visual representations for visuomotor control. DynaMo exploits the causal structure in demonstrations by jointly learning the encoder with inverse and forward dynamics models. DynaMo requires no augmentations, contrastive sampling, or access to ground truth actions. This enables downstream policy learning using limited in-domain data across simulated and real-world robotics tasks. For each environment, we pretrain the visual representation in-domain with DynaMo and learn a policy on the pretrained embeddings. (b) We provide real-world rollouts of policies learned with DynaMo representation on our multi-task xArm Kitchen and Allegro Manipulation environments.
Figure 2: Embedding nearest neighbor matches for DynaMo, BYOL, MoCo, and TCN on the Block Pushing environment. (Top) The nearest neighbor matches visualized in pixel space. (Bottom) Matches visualized in a top-down view. We see that the DynaMo representation captures task-relevant features (end effector, block, and target locations in this case), whereas prior work fixates on the large robot arm.
Figure 3: Architecture of DynaMo. DynaMo jointly learns an image encoder, an inverse dynamics model, and a forward dynamics model with a forward dynamics prediction loss.
Figure 4: We evaluate DynaMo on four simulated benchmarks - Franka Kitchen, Block Pushing, Push-T, and LIBERO Goal, and two real-world environments - Allegro Manipulation, and xArm Kitchen.
Figure 5: xArm Kitchen environment tasks
...and 2 more figures

DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

TL;DR

Abstract

DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

Authors

TL;DR

Abstract

Table of Contents

Figures (7)