Table of Contents
Fetching ...

Latent Policy Steering with Embodiment-Agnostic Pretrained World Models

Yiqi Wang, Mrinal Verghese, Jeff Schneider

TL;DR

The paper tackles data-efficiency in visuomotor imitation across diverse robot embodiments by pretraining an embodiment-agnostic World Model (WM) using optical-flow actions derived from cross-embodiment data (robots and humans). It then finetunes the WM on a small target-embodiment dataset and introduces Latent Policy Steering (LPS), a robust value-function-based method that guides a behavior-cloned policy toward states similar to the training data and higher rewards. Empirical results in both simulation (Robomimic) and real-world setups show substantial improvements in low-data scenarios, with further gains from including human play data in pretraining. The work demonstrates strong cross-embodiment transfer, reduces data collection needs, and offers a scalable approach for data-efficient robotic policy learning.

Abstract

Learning visuomotor policies via imitation has proven effective across a wide range of robotic domains. However, the performance of these policies is heavily dependent on the number of training demonstrations, which requires expensive data collection in the real world. In this work, we aim to reduce data collection efforts when learning visuomotor robot policies by leveraging existing or cost-effective data from a wide range of embodiments, such as public robot datasets and the datasets of humans playing with objects (human data from play). Our approach leverages two key insights. First, we use optic flow as an embodiment-agnostic action representation to train a World Model (WM) across multi-embodiment datasets, and finetune it on a small amount of robot data from the target embodiment. Second, we develop a method, Latent Policy Steering (LPS), to improve the output of a behavior-cloned policy by searching in the latent space of the WM for better action sequences. In real world experiments, we observe significant improvements in the performance of policies trained with a small amount of data (over 50% relative improvement with 30 demonstrations and over 20% relative improvement with 50 demonstrations) by combining the policy with a WM pretrained on two thousand episodes sampled from the existing Open X-embodiment dataset across different robots or a cost-effective human dataset from play.

Latent Policy Steering with Embodiment-Agnostic Pretrained World Models

TL;DR

The paper tackles data-efficiency in visuomotor imitation across diverse robot embodiments by pretraining an embodiment-agnostic World Model (WM) using optical-flow actions derived from cross-embodiment data (robots and humans). It then finetunes the WM on a small target-embodiment dataset and introduces Latent Policy Steering (LPS), a robust value-function-based method that guides a behavior-cloned policy toward states similar to the training data and higher rewards. Empirical results in both simulation (Robomimic) and real-world setups show substantial improvements in low-data scenarios, with further gains from including human play data in pretraining. The work demonstrates strong cross-embodiment transfer, reduces data collection needs, and offers a scalable approach for data-efficient robotic policy learning.

Abstract

Learning visuomotor policies via imitation has proven effective across a wide range of robotic domains. However, the performance of these policies is heavily dependent on the number of training demonstrations, which requires expensive data collection in the real world. In this work, we aim to reduce data collection efforts when learning visuomotor robot policies by leveraging existing or cost-effective data from a wide range of embodiments, such as public robot datasets and the datasets of humans playing with objects (human data from play). Our approach leverages two key insights. First, we use optic flow as an embodiment-agnostic action representation to train a World Model (WM) across multi-embodiment datasets, and finetune it on a small amount of robot data from the target embodiment. Second, we develop a method, Latent Policy Steering (LPS), to improve the output of a behavior-cloned policy by searching in the latent space of the WM for better action sequences. In real world experiments, we observe significant improvements in the performance of policies trained with a small amount of data (over 50% relative improvement with 30 demonstrations and over 20% relative improvement with 50 demonstrations) by combining the policy with a WM pretrained on two thousand episodes sampled from the existing Open X-embodiment dataset across different robots or a cost-effective human dataset from play.

Paper Structure

This paper contains 22 sections, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: World Model (WM) pretraining with optical flow as an embodiment-agnostic action representation. Thanks to optical flow, we can integrate data from multiple types of embodiments (robots, humans), with a shared action space. A WM is chosen for pretraining since it can leverage suboptimal data. We encode every optical flow to a vector input of the WM, and train the encoder end-to-end with the WM.
  • Figure 2: Overview of the method. The figure illustrates all the phases of our method, including WM pretraining, finetuning, and Latent Policy Steering (LPS). After pretraining the WM with optic flow actions, the second phase involves collecting a small robot dataset to learn a policy from scratch, finetuning the WM, and learn a robust value function based on the WM. During inference, the WM and value function are used to select the best candidate.
  • Figure 3: Optical flow as an embodiment-agnostic action representation. We observe that motions captured by optical flow across embodiments are similar in visual space. By using optical flow as an action representation, we remove WM's dependency on specific embodiments, allowing the pretrained model to efficiently leverage data from multiple embodiments.
  • Figure 4: Real world environment. We conduct the real-world experiment in a table-top setting with a Franka robot, given a set of common objects.
  • Figure 5: Effects of the pretrain embodiments. World Models (WM) are pretrained on data with different embodiments (i.e., recipes). They are finetuned on the same robot dataset and combined with the same policy to solve the same task. We observe a promising trend: our approach can effectively leverage data across multiple embodiments. By increasing the number of pretraining embodiments, the amount of pretraining data grows, results in better performance compared to the policy-only baseline (BC) and WM without pretraining (i.e., LPS-scratch). Surprisingly, human video data leads to a competitive WM checkpoint (green), compared to the recipe with more embodiments and more data.
  • ...and 1 more figures