Table of Contents
Fetching ...

AMPLIFY: Actionless Motion Priors for Robot Learning from Videos

Jeremy A. Collins, Loránd Cheng, Kunal Aneja, Albert Wilcox, Benjamin Joffe, Animesh Garg

TL;DR

AMPLIFY tackles the data bottleneck in robotics by decoupling dynamics learning from policy execution and leveraging abundant action-free video data. The method learns a latent keypoint motion representation via FSQ, trains a forward dynamics model on videos, and learns an inverse dynamics policy from limited action data, enabling data-efficient policy learning and zero-shot generalization. Empirical results show substantial improvements in forward dynamics accuracy, improved policy performance in low-data regimes, and the ability to transfer across embodiments, with additional benefits in conditional video generation. This work demonstrates a versatile framework for combining heterogeneous data sources to build scalable, generalizable world models for visuomotor tasks.

Abstract

Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a novel framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it. We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples, allowing for independent scaling. Extensive evaluations demonstrate that the learned dynamics are both accurate, achieving up to 3.7x better MSE and over 2.5x better pixel prediction accuracy compared to prior approaches, and broadly useful. In downstream policy learning, our dynamics predictions enable a 1.2-2.2x improvement in low-data regimes, a 1.4x average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks from zero in-distribution action data. Beyond robotic control, we find the dynamics learned by AMPLIFY to be a versatile latent world model, enhancing video prediction quality. Our results present a novel paradigm leveraging heterogeneous data sources to build efficient, generalizable world models. More information can be found at https://amplify-robotics.github.io/.

AMPLIFY: Actionless Motion Priors for Robot Learning from Videos

TL;DR

AMPLIFY tackles the data bottleneck in robotics by decoupling dynamics learning from policy execution and leveraging abundant action-free video data. The method learns a latent keypoint motion representation via FSQ, trains a forward dynamics model on videos, and learns an inverse dynamics policy from limited action data, enabling data-efficient policy learning and zero-shot generalization. Empirical results show substantial improvements in forward dynamics accuracy, improved policy performance in low-data regimes, and the ability to transfer across embodiments, with additional benefits in conditional video generation. This work demonstrates a versatile framework for combining heterogeneous data sources to build scalable, generalizable world models for visuomotor tasks.

Abstract

Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a novel framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it. We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples, allowing for independent scaling. Extensive evaluations demonstrate that the learned dynamics are both accurate, achieving up to 3.7x better MSE and over 2.5x better pixel prediction accuracy compared to prior approaches, and broadly useful. In downstream policy learning, our dynamics predictions enable a 1.2-2.2x improvement in low-data regimes, a 1.4x average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks from zero in-distribution action data. Beyond robotic control, we find the dynamics learned by AMPLIFY to be a versatile latent world model, enhancing video prediction quality. Our results present a novel paradigm leveraging heterogeneous data sources to build efficient, generalizable world models. More information can be found at https://amplify-robotics.github.io/.

Paper Structure

This paper contains 45 sections, 4 equations, 10 figures, 16 tables, 1 algorithm.

Figures (10)

  • Figure 1: Architecture. Amplify consists of a three-stage decomposition: (a) keypoint tracks are compressed into a discrete latent space using FSQ. For each timestep and each point, the decoder outputs a distribution in a local window centered around each point to reconstruct the instantaneous velocities, (b) a forward dynamics model is trained to predict the latent codes for the next $T$ timesteps given an input image and task description, and (c) an inverse dynamics model decodes predicted track tokens into an action chunk.
  • Figure 2: Decoded keypoint trajectory predictions from Amplify. Zero-movement points are not shown.
  • Figure 3: LIBERO few-shot. Comparison of Amplify against ATM wen_any-point_2024 and a no-video-pre-training baseline. Our forward model is trained on all videos, and the inverse model is only trained on a limited number of demos.
  • Figure 4: A sample of the 130 diverse tasks and environment configurations in LIBERO.
  • Figure 5: We use three static RGB cameras as input observations for both human and robot (UR5) data.
  • ...and 5 more figures