Table of Contents
Fetching ...

Learning Additively Compositional Latent Actions for Embodied AI

Hangxing Wei, Xiaoyu Chen, Chuheng Zhang, Tim Pearce, Jianyu Chen, Alex Lamb, Li Zhao, Jiang Bian

Abstract

Latent action learning infers pseudo-action labels from visual transitions, providing an approach to leverage internet-scale video for embodied AI. However, most methods learn latent actions without structural priors that encode the additive, compositional structure of physical motion. As a result, latents often entangle irrelevant scene details or information about future observations with true state changes and miscalibrate motion magnitude. We introduce Additively Compositional Latent Action Model (AC-LAM), which enforces scene-wise additive composition structure over short horizons on the latent action space. These AC constraints encourage simple algebraic structure in the latent action space~(identity, inverse, cycle consistency) and suppress information that does not compose additively. Empirically, AC-LAM learns more structured, motion-specific, and displacement-calibrated latent actions and provides stronger supervision for downstream policy learning, outperforming state-of-the-art LAMs across simulated and real-world tabletop tasks.

Learning Additively Compositional Latent Actions for Embodied AI

Abstract

Latent action learning infers pseudo-action labels from visual transitions, providing an approach to leverage internet-scale video for embodied AI. However, most methods learn latent actions without structural priors that encode the additive, compositional structure of physical motion. As a result, latents often entangle irrelevant scene details or information about future observations with true state changes and miscalibrate motion magnitude. We introduce Additively Compositional Latent Action Model (AC-LAM), which enforces scene-wise additive composition structure over short horizons on the latent action space. These AC constraints encourage simple algebraic structure in the latent action space~(identity, inverse, cycle consistency) and suppress information that does not compose additively. Empirically, AC-LAM learns more structured, motion-specific, and displacement-calibrated latent actions and provides stronger supervision for downstream policy learning, outperforming state-of-the-art LAMs across simulated and real-world tabletop tasks.

Paper Structure

This paper contains 46 sections, 5 theorems, 15 equations, 6 figures, 6 tables.

Key Result

Proposition 3.2

It holds that $z_{ii}=0$. $\blacktriangleleft$$\blacktriangleleft$

Figures (6)

  • Figure 1: Evolution of the normalized latent action norm $\|LAM(o_0, o_t)\|$ over time intervals $t$. The figure shows that, compared with baselines, our AC constraints (AC-LAM) induce displacement-calibrated latent actions, effectively capturing the magnitude of the transition from $o_0$ to $o_t$.
  • Figure 2: Additively Compositional Latent Action Model (AC-LAM). For triples $(o_i,o_j,o_k)$ from the same scene, scene-wise additivity encourages $z_{ik}\approx z_{ij}+z_{jk}$, which regularizes the latent action space on top of a standard IDM–FDM architecture. The red line denotes the $(i,j)$ mapping with IDM encoder $z_{ij} = f(o_i, o_j)$ and FDM decoder $\hat{o}_j=F_{z_{ij}}(o_i)$. The blue and green lines depict the corresponding mappings for $(j,k)$ and $(i,k)$, respectively.
  • Figure 3: Trajectory of the latent action norm $||f(o_0,o_t)||$ in real-world tabletop manipulation, with latent actions generated by LAPA LAM, UniVLA LAM, Villa-X LAM and AC-LAM. AC‑LAM yields the most displacement‑calibrated latents, aligning with motion magnitude.
  • Figure 4: Two experimental environments: (a) Emoji Table-Top (GrinningFace) simulation for controlled studies of vision–semantic generalization. A robotic arm picks a cube and places it on the instructed emoji. The viewpoint is aligned with Bridge‑v2 to leverage this large-scale dataset for knowledge transfer. The initial positions of the cube, the emojis, and the robotic arm are randomized to test robustness. (b) Real‑World Tabletop Manipulation featuring diverse pick tasks across varied objects and backgrounds; evaluations cover in‑distribution scenes, OOD distractors, and OOD backgrounds to assess robustness.
  • Figure 5: Motion Transfer Demo
  • ...and 1 more figures

Theorems & Definitions (11)

  • Definition 3.1: Scene
  • Proposition 3.2: Identity
  • proof
  • Proposition 3.3: Inverse consistency
  • proof
  • Proposition 3.4: Cycle consistency
  • proof
  • Proposition 3.5: No scene-related bias
  • proof
  • Proposition 3.6: No future leakage
  • ...and 1 more