Table of Contents
Fetching ...

LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

Jiangran Lyu, Kai Liu, Xuheng Zhang, Haoran Liao, Yusen Feng, Wenxuan Zhu, Tingrui Shen, Jiayi Chen, Jiazhao Zhang, Yifei Dong, Wenbo Cui, Senmao Qi, Shuo Wang, Yixin Zheng, Mi Yan, Xuesong Shi, Haoran Li, Dongbin Zhao, Ming-Yu Liu, Zhizheng Zhang, Li Yi, Yizhou Wang, He Wang

TL;DR

LDA-1B addresses the scalability bottleneck of robot foundation models by coupling universal embodied data ingestion with a semantically structured latent space and a multi-modal diffusion transformer. It jointly learns policy, dynamics, and visual forecasting using EI-30K, a large, standardized dataset that spans real and simulated robots and humans, with role-aware supervision for data quality. The model demonstrates strong generalization and data efficiency across simulation and real-world tasks, including contact-rich and dexterous manipulation, and shows clear benefits from latent dynamics over pixel-based representations. This approach offers a practical pathway to scalable, dynamics-aware robot pretraining with mixed-quality data and cross-embodiment transfer, potentially accelerating real-world robotic capabilities.

Abstract

Recent robot foundation models largely rely on large-scale behavior cloning, which imitates expert actions but discards transferable dynamics knowledge embedded in heterogeneous embodied data. While the Unified World Model (UWM) formulation has the potential to leverage such diverse data, existing instantiations struggle to scale to foundation-level due to coarse data usage and fragmented datasets. We introduce LDA-1B, a robot foundation model that scales through universal embodied data ingestion by jointly learning dynamics, policy, and visual forecasting, assigning distinct roles to data of varying quality. To support this regime at scale, we assemble and standardize EI-30k, an embodied interaction dataset comprising over 30k hours of human and robot trajectories in a unified format. Scalable dynamics learning over such heterogeneous data is enabled by prediction in a structured DINO latent space, which avoids redundant pixel-space appearance modeling. Complementing this representation, LDA-1B employs a multi-modal diffusion transformer to handle asynchronous vision and action streams, enabling stable training at the 1B-parameter scale. Experiments in simulation and the real world show LDA-1B outperforms prior methods (e.g., $π_{0.5}$) by up to 21\%, 48\%, and 23\% on contact-rich, dexterous, and long-horizon tasks, respectively. Notably, LDA-1B enables data-efficient fine-tuning, gaining 10\% by leveraging 30\% low-quality trajectories typically harmful and discarded.

LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion

TL;DR

LDA-1B addresses the scalability bottleneck of robot foundation models by coupling universal embodied data ingestion with a semantically structured latent space and a multi-modal diffusion transformer. It jointly learns policy, dynamics, and visual forecasting using EI-30K, a large, standardized dataset that spans real and simulated robots and humans, with role-aware supervision for data quality. The model demonstrates strong generalization and data efficiency across simulation and real-world tasks, including contact-rich and dexterous manipulation, and shows clear benefits from latent dynamics over pixel-based representations. This approach offers a practical pathway to scalable, dynamics-aware robot pretraining with mixed-quality data and cross-embodiment transfer, potentially accelerating real-world robotic capabilities.

Abstract

Recent robot foundation models largely rely on large-scale behavior cloning, which imitates expert actions but discards transferable dynamics knowledge embedded in heterogeneous embodied data. While the Unified World Model (UWM) formulation has the potential to leverage such diverse data, existing instantiations struggle to scale to foundation-level due to coarse data usage and fragmented datasets. We introduce LDA-1B, a robot foundation model that scales through universal embodied data ingestion by jointly learning dynamics, policy, and visual forecasting, assigning distinct roles to data of varying quality. To support this regime at scale, we assemble and standardize EI-30k, an embodied interaction dataset comprising over 30k hours of human and robot trajectories in a unified format. Scalable dynamics learning over such heterogeneous data is enabled by prediction in a structured DINO latent space, which avoids redundant pixel-space appearance modeling. Complementing this representation, LDA-1B employs a multi-modal diffusion transformer to handle asynchronous vision and action streams, enabling stable training at the 1B-parameter scale. Experiments in simulation and the real world show LDA-1B outperforms prior methods (e.g., ) by up to 21\%, 48\%, and 23\% on contact-rich, dexterous, and long-horizon tasks, respectively. Notably, LDA-1B enables data-efficient fine-tuning, gaining 10\% by leveraging 30\% low-quality trajectories typically harmful and discarded.
Paper Structure (36 sections, 3 equations, 16 figures, 9 tables)

This paper contains 36 sections, 3 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: We introduce LDA-1B, a 1.6 B-parameter robot foundation model scaled on over 30k hours of heterogeneous embodied data. LDA-1B unifies policy, dynamics, and visual forecasting in a structured DINO simeoni2025dinov3 latent space, allowing different data sources to play complementary roles. Beyond high-quality data alone, noisy data and actionless videos also provide valuable visual and physical priors for dynamics learning. This universal data ingestion paradigm enables stable scaling with data and model size, significantly outperforming strong baselines such as $\pi_{0.5}$intelligence2025pi_ across diverse manipulation tasks.
  • Figure 2: Architecture of LDA. LDA jointly denoises action chunks and future visual latent under multiple co-training objectives, including policy learning, forward dynamics, inverse dynamics, and visual forecasting. Conditioned on VLM tokens, diffusion timesteps, and task embeddings, the model adopts a multimodal diffusion transformer architecture, where action and visual experts are decoupled and interact through a shared self-attention layer.
  • Figure 3: Aligned End Effector Coordinate Systems. We manually align coordinate frames across diverse robot and human embodiments to ensure consistency. This shared representation enables joint learning from heterogeneous interaction data.
  • Figure 4: Statistics of EI-30K. The dataset contains more than 30k hours of diverse human and robot interaction data (right). It spans varying episode lengths (left) and a rich set of manipulation tasks (center).
  • Figure 5: Real-World Manipulation Demonstrations Across Multiple Robotic Platforms and End-Effectors. Galbot G1 equipped with a Sharpa dexterous hand (top-left), Unitree G1 with a BrainCo dexterous hand (middle and bottom-left), and Galbot G1 with a two-finger gripper (right).
  • ...and 11 more figures