Table of Contents
Fetching ...

What Do Latent Action Models Actually Learn?

Chuheng Zhang, Tim Pearce, Pushi Zhang, Kaixin Wang, Xiaoyu Chen, Wei Shen, Li Zhao, Jiang Bian

TL;DR

The paper analyzes Latent Action Models by introducing a tractable linear abstraction and showing that, under reasonable assumptions, learning reduces to PCA on the sum of controllable changes and exogenous noise. It then elucidates how data collection policy and noise structure affect the learned latent, and demonstrates practical remedies—data augmentation and auxiliary action prediction—to improve alignment with true actions. The authors validate their theoretical insights with nonlinear experiments on a small grid-world, offering actionable guidance for designing LAM datasets and training procedures to ensure latent semantics reflect controllable changes rather than noise. Overall, the work provides a principled lens on LAM learnability and concrete strategies to enhance their reliability in unsupervised pretraining for embodied AI.

Abstract

Latent action models (LAMs) aim to learn action-relevant changes from unlabeled videos by compressing changes between frames as latents. However, differences between video frames can be caused by controllable changes as well as exogenous noise, leading to an important concern -- do latents capture the changes caused by actions or irrelevant noise? This paper studies this issue analytically, presenting a linear model that encapsulates the essence of LAM learning, while being tractable.This provides several insights, including connections between LAM and principal component analysis (PCA), desiderata of the data-generating policy, and justification of strategies to encourage learning controllable changes using data augmentation, data cleaning, and auxiliary action-prediction. We also provide illustrative results based on numerical simulation, shedding light on the specific structure of observations, actions, and noise in data that influence LAM learning.

What Do Latent Action Models Actually Learn?

TL;DR

The paper analyzes Latent Action Models by introducing a tractable linear abstraction and showing that, under reasonable assumptions, learning reduces to PCA on the sum of controllable changes and exogenous noise. It then elucidates how data collection policy and noise structure affect the learned latent, and demonstrates practical remedies—data augmentation and auxiliary action prediction—to improve alignment with true actions. The authors validate their theoretical insights with nonlinear experiments on a small grid-world, offering actionable guidance for designing LAM datasets and training procedures to ensure latent semantics reflect controllable changes rather than noise. Overall, the work provides a principled lens on LAM learnability and concrete strategies to enhance their reliability in unsupervised pretraining for embodied AI.

Abstract

Latent action models (LAMs) aim to learn action-relevant changes from unlabeled videos by compressing changes between frames as latents. However, differences between video frames can be caused by controllable changes as well as exogenous noise, leading to an important concern -- do latents capture the changes caused by actions or irrelevant noise? This paper studies this issue analytically, presenting a linear model that encapsulates the essence of LAM learning, while being tractable.This provides several insights, including connections between LAM and principal component analysis (PCA), desiderata of the data-generating policy, and justification of strategies to encourage learning controllable changes using data augmentation, data cleaning, and auxiliary action-prediction. We also provide illustrative results based on numerical simulation, shedding light on the specific structure of observations, actions, and noise in data that influence LAM learning.

Paper Structure

This paper contains 18 sections, 4 theorems, 15 equations, 7 figures, 1 table.

Key Result

Proposition 4.1

Under the linear LAM model and setup defined in Section sec:setup_linearlam, and additionally assuming $\mathbb{E}[\boldsymbol{o}(\boldsymbol{q}+\boldsymbol{\epsilon})^T] = \mathbf{0}$, the objective of linear LAM is equivalent to performing PCA on a mixture of controllable changes $\boldsymbol{q}$

Figures (7)

  • Figure 1: Linear LAM is an abstraction of the LAMs used in previous work. Inputting consecutive observation pairs $(\boldsymbol{o}, \boldsymbol{o}')$, the LAMs output the second observation via a reconstruction loss, $\lVert \hat{\boldsymbol{o}}' - \boldsymbol{o}' \rVert_2^2$. An information bottleneck tries to stop the direct copying of $\boldsymbol{o}'$, with the expectation the latent $\boldsymbol{z}$ will correspond to the control action $\mathbf{a}$. Linear LAM captures the essence of LAM training whilst being analytically tractable. The diagrams of previous LAMs are copied from their original papers: LAPO schmidt2023lapo, LAPA ye2024lapa, Moto chen2024moto, Genie bruce2024genie, AdaWorld gao2025adaworld, and Go-1 agibot2025aigbot.
  • Figure 2: Overview of linear LAM. Grey blocks represent learnable parameter matrices, giving rise to the predictive model $\hat{\boldsymbol{o}}' = A\boldsymbol{o} + B(C\boldsymbol{o} + D\boldsymbol{o}')$. Green illustrates linear LAM with data augmentation to reduce the amount of information the latent contains about the observation (in Section \ref{['sec:analysis3']}). Pink illustrate linear LAM with auxiliary action prediction to encourage the latent to focus on the controllable actions and suppress the noise signal (in Section \ref{['sec:analysis4']}).
  • Figure 3: LLO (Linear LAM objective \ref{['eq:linear_lam_obj']}, higher better) measured in three noise settings. (Left) $\boldsymbol{\epsilon} = 0$. (Middle) $\boldsymbol{\epsilon}$ is i.i.d. noise. (Right) $\boldsymbol{\epsilon}$ contains the effect of other agents. Action MSE, noise MSE, and observation MSE are the three terms in \ref{['eq:linear_lam_obj']}. We set real action dimension $d_a=8$ and exogenous action dimension $d_b=8$ unless otherwise stated, and ensure $\boldsymbol{q}$ has unit variance.
  • Figure 4: Visualizaton of the technical results in Section \ref{['sec:analysis1']}, when correlation between $\boldsymbol{o}, \boldsymbol{q}$ and $\mathbf{\epsilon}$ are assumed zero, variance decomposes additively. As with PCA, linear LAM's latent indiscriminately captures the $d_z$ dimensions with largest variance (i.e., the top dimensions in the figure), regardless of the source of this variance.
  • Figure 5: Intuitive examples of the noise cases defined in Section \ref{['sec:analysis1']}. Images are taken from the RT-1 (the first two examples) and Ego4d (the last example) datasets. Differences in subsequent frames in the RT-1 dataset are dominated by the controllable actions, while the camera shake in Ego4d leads to large visual changes not related to the actions of interest (here hand manipulation).
  • ...and 2 more figures

Theorems & Definitions (5)

  • Proposition 4.1: Linear LAM is PCA
  • Proposition 4.2: Linear LAM tries to capture $\boldsymbol{q}+\boldsymbol{\epsilon}$
  • Proposition 4.3: Data augmentation addresses over-parameterization
  • Proposition 4.4: Action prediction can denoise
  • proof