Table of Contents
Fetching ...

Learning to Act without Actions

Dominik Schmidt, Minqi Jiang

TL;DR

The paper tackles the lack of action labels in web-scale videos for RL pretraining by introducing Latent Action Policies (LAPO), which learns continuous latent actions through a latent inverse-forward dynamics loop and a vector-quantized bottleneck. A latent-action policy is learned via behavior cloning on the inferred latent actions and can be decoded to true actions either with a small labeled dataset or via online RL fine-tuning, achieving expert-level performance on Procgen with far fewer environment interactions. Key contributions include demonstrating that latent actions align with the true action space without ground-truth labels, and that latent-action policies can be rapidly adapted online or offline, suggesting a path toward web-scale unsupervised pretraining for generalist RL agents. The work highlights the practical potential of action-free video data to bootstrap powerful, generalizable policies and world models for downstream tasks.

Abstract

Pre-training large models on vast amounts of web data has proven to be an effective approach for obtaining powerful, general models in domains such as language and vision. However, this paradigm has not yet taken hold in reinforcement learning. This is because videos, the most abundant form of embodied behavioral data on the web, lack the action labels required by existing methods for imitating behavior from demonstrations. We introduce Latent Action Policies (LAPO), a method for recovering latent action information, and thereby latent-action policies, world models, and inverse dynamics models, purely from videos. LAPO is the first method able to recover the structure of the true action space just from observed dynamics, even in challenging procedurally-generated environments. LAPO enables training latent-action policies that can be rapidly fine-tuned into expert-level policies, either offline using a small action-labeled dataset, or online with rewards. LAPO takes a first step towards pre-training powerful, generalist policies and world models on the vast amounts of videos readily available on the web.

Learning to Act without Actions

TL;DR

The paper tackles the lack of action labels in web-scale videos for RL pretraining by introducing Latent Action Policies (LAPO), which learns continuous latent actions through a latent inverse-forward dynamics loop and a vector-quantized bottleneck. A latent-action policy is learned via behavior cloning on the inferred latent actions and can be decoded to true actions either with a small labeled dataset or via online RL fine-tuning, achieving expert-level performance on Procgen with far fewer environment interactions. Key contributions include demonstrating that latent actions align with the true action space without ground-truth labels, and that latent-action policies can be rapidly adapted online or offline, suggesting a path toward web-scale unsupervised pretraining for generalist RL agents. The work highlights the practical potential of action-free video data to bootstrap powerful, generalizable policies and world models for downstream tasks.

Abstract

Pre-training large models on vast amounts of web data has proven to be an effective approach for obtaining powerful, general models in domains such as language and vision. However, this paradigm has not yet taken hold in reinforcement learning. This is because videos, the most abundant form of embodied behavioral data on the web, lack the action labels required by existing methods for imitating behavior from demonstrations. We introduce Latent Action Policies (LAPO), a method for recovering latent action information, and thereby latent-action policies, world models, and inverse dynamics models, purely from videos. LAPO is the first method able to recover the structure of the true action space just from observed dynamics, even in challenging procedurally-generated environments. LAPO enables training latent-action policies that can be rapidly fine-tuned into expert-level policies, either offline using a small action-labeled dataset, or online with rewards. LAPO takes a first step towards pre-training powerful, generalist policies and world models on the vast amounts of videos readily available on the web.
Paper Structure (26 sections, 2 equations, 9 figures, 1 table)

This paper contains 26 sections, 2 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: UMAP projection of the learned latent action space for Miner alongside illustrative next-state predictions generated by the FDM for each cluster of latent actions. Each point represents a continuous latent action generated by the IDM for a transition in the video dataset. Each point is color-coded by the true action taken by the agent at that transition. For clarity, NOOP actions are omitted. The structure of the latent action space is highly interpretable and closely corresponds to the true action space, even though no ground-truth action labels were used during training.
  • Figure 2: LAPO architecture. Both IDM and FDM observe $o_t$, but only the IDM observes $o_{t+1}$. To enable accurate predictions of $o_{t+1}$, the IDM must pass useful transition information through the quantized information bottleneck $z_t$ to the FDM.
  • Figure 3: Left: Mean episodic returns (over the course of training) for decoding LAPO's latent policy and PPO from scratch (averaged across 3 seeds). Right: Mean test returns relative to per-environment expert policies averaged across all 16 Procgen environments. Error bars indicate standard deviation across seeds.
  • Figure 4: Offline decoding performance vs. # labeled transitions (Mean and std across 3 seeds).
  • Figure 5: UMAP projection of the learned latent action space for all 16 procgen games. Each point represents the continuous (pre-quantization) latent action generated by the IDM for a transition in the observation-only dataset. Each point is color-coded by the true action taken by the agent at that transition (true action labels are only for visualization, not used for training). Arrows in the legend correspond to movement directions. NOOP actions are omitted for clarity.
  • ...and 4 more figures