Table of Contents
Fetching ...

Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild

Hao Luo, Ye Wang, Wanpeng Zhang, Haoqi Yuan, Yicheng Feng, Haiweng Xu, Sipeng Zheng, Zongqing Lu

TL;DR

JALA is presented, a pretraining framework that learns Jointly-Aligned Latent Actions, a transition-aware, behavior-centric latent space for learning from heterogeneous human data that generates more realistic hand motions in both controlled and unconstrained scenarios.

Abstract

Despite progress, Vision-Language-Action models (VLAs) are limited by a scarcity of large-scale, diverse robot data. While human manipulation videos offer a rich alternative, existing methods are forced to choose between small, precisely-labeled datasets and vast in-the-wild footage with unreliable hand tracking labels. We present JALA, a pretraining framework that learns Jointly-Aligned Latent Actions. JALA bypasses full visual dynamic reconstruction, instead learns a predictive action embedding aligned with both inverse dynamics and real actions. This yields a transition-aware, behavior-centric latent space for learning from heterogeneous human data. We scale this approach with UniHand-Mix, a 7.5M video corpus (>2,000 hours) blending laboratory and in-the-wild footage. Experiments demonstrate that JALA generates more realistic hand motions in both controlled and unconstrained scenarios, significantly improving downstream robot manipulation performance in both simulation and real-world tasks. These results indicate that jointly-aligned latent actions offer a scalable pathway for VLA pretraining from human data.

Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild

TL;DR

JALA is presented, a pretraining framework that learns Jointly-Aligned Latent Actions, a transition-aware, behavior-centric latent space for learning from heterogeneous human data that generates more realistic hand motions in both controlled and unconstrained scenarios.

Abstract

Despite progress, Vision-Language-Action models (VLAs) are limited by a scarcity of large-scale, diverse robot data. While human manipulation videos offer a rich alternative, existing methods are forced to choose between small, precisely-labeled datasets and vast in-the-wild footage with unreliable hand tracking labels. We present JALA, a pretraining framework that learns Jointly-Aligned Latent Actions. JALA bypasses full visual dynamic reconstruction, instead learns a predictive action embedding aligned with both inverse dynamics and real actions. This yields a transition-aware, behavior-centric latent space for learning from heterogeneous human data. We scale this approach with UniHand-Mix, a 7.5M video corpus (>2,000 hours) blending laboratory and in-the-wild footage. Experiments demonstrate that JALA generates more realistic hand motions in both controlled and unconstrained scenarios, significantly improving downstream robot manipulation performance in both simulation and real-world tasks. These results indicate that jointly-aligned latent actions offer a scalable pathway for VLA pretraining from human data.
Paper Structure (40 sections, 7 equations, 11 figures, 5 tables)

This paper contains 40 sections, 7 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Comparison of VLA's latent action paradigms with human videos.(left) Prior reconstruction-based methods like LAPA ye2024lapa rely on multi-stage pipelines extracting latent actions via dynamics reconstruction as pseudo-labels. (middle) Our JALA introduces predictive embeddings aligned with latent actions. (right) Transformer-based JALA implementation where intermediate hidden states serve as the predictive embeddings to align with latent actions, while output tokens use available action labels as supervision.
  • Figure 2: The JALA framework. Pre-training (left): Hidden states of masked motion chunks serve as predictive embeddings to align with latent actions from boundary frames. The Latent Action Perceiver (LAP) maps boundary frames to latent action space, providing supervision without action labels. A parameter-shared Latent State Perceiver (LSP) injects initial frame context, with LAP and LSP linked via decoupled EMA update for stability. Post-training (right): The predictive embeddings are fed into a flow-matching head for robot task transfer.
  • Figure 3: Dataset statistics of UniHand-Mix. Top-left: distribution of data types (motion generation, video-only, motion description, and motion continuation). Bottom-left: distribution of clip lengths (1--10 seconds). Center: data source distribution across 8 data sources with a donut percentage chart on the right.
  • Figure 4: t-SNE of predictive embeddings $h$ and latent actions $z$ across Lab and Wild. The two spaces cluster in closely aligned regions, and Wild samples largely expand the Lab manifold.
  • Figure 5: Qualitative hand-motion generation on lab (left column) and wild (right column) scenes. Colored overlays denote generated hand poses.
  • ...and 6 more figures