Table of Contents
Fetching ...

LAP: Language-Action Pre-Training Enables Zero-shot Cross-Embodiment Transfer

Lihan Zha, Asher J. Hancock, Mingtong Zhang, Tenny Yin, Yixuan Huang, Dhruv Shah, Allen Z. Ren, Anirudha Majumdar

TL;DR

This work tackles zero-shot cross-embodiment transfer for robotic policies by introducing Language-Action Pre-training (LAP), which represents motor actions as natural language to align supervision with pre-trained vision–language models. The resulting LAP-3B system combines a LAP-trained VLM backbone with a lightweight diffusion-based action expert, enabling real-time control and substantial zero-shot generalization to unseen robot embodiments across multiple tasks. Empirical results show over 50% average zero-shot success on novel robots, roughly a 2× improvement over prior VLAs, and improved data and compute efficiency when fine-tuning to new embodiments. The approach also benefits from co-training with VQA tasks and scales favorably with model size, suggesting a practical path toward broad, embodied robotics capabilities without per-embodiment fine-tuning.

Abstract

A long-standing goal in robotics is a generalist policy that can be deployed zero-shot on new robot embodiments without per-embodiment adaptation. Despite large-scale multi-embodiment pre-training, existing Vision-Language-Action models (VLAs) remain tightly coupled to their training embodiments and typically require costly fine-tuning. We introduce Language-Action Pre-training (LAP), a simple recipe that represents low-level robot actions directly in natural language, aligning action supervision with the pre-trained vision-language model's input-output distribution. LAP requires no learned tokenizer, no costly annotation, and no embodiment-specific architectural design. Based on LAP, we present LAP-3B, which to the best of our knowledge is the first VLA to achieve substantial zero-shot transfer to previously unseen robot embodiments without any embodiment-specific fine-tuning. Across multiple novel robots and manipulation tasks, LAP-3B attains over 50% average zero-shot success, delivering roughly a 2x improvement over the strongest prior VLAs. We further show that LAP enables efficient adaptation and favorable scaling, while unifying action prediction and VQA in a shared language-action format that yields additional gains through co-training.

LAP: Language-Action Pre-Training Enables Zero-shot Cross-Embodiment Transfer

TL;DR

This work tackles zero-shot cross-embodiment transfer for robotic policies by introducing Language-Action Pre-training (LAP), which represents motor actions as natural language to align supervision with pre-trained vision–language models. The resulting LAP-3B system combines a LAP-trained VLM backbone with a lightweight diffusion-based action expert, enabling real-time control and substantial zero-shot generalization to unseen robot embodiments across multiple tasks. Empirical results show over 50% average zero-shot success on novel robots, roughly a 2× improvement over prior VLAs, and improved data and compute efficiency when fine-tuning to new embodiments. The approach also benefits from co-training with VQA tasks and scales favorably with model size, suggesting a practical path toward broad, embodied robotics capabilities without per-embodiment fine-tuning.

Abstract

A long-standing goal in robotics is a generalist policy that can be deployed zero-shot on new robot embodiments without per-embodiment adaptation. Despite large-scale multi-embodiment pre-training, existing Vision-Language-Action models (VLAs) remain tightly coupled to their training embodiments and typically require costly fine-tuning. We introduce Language-Action Pre-training (LAP), a simple recipe that represents low-level robot actions directly in natural language, aligning action supervision with the pre-trained vision-language model's input-output distribution. LAP requires no learned tokenizer, no costly annotation, and no embodiment-specific architectural design. Based on LAP, we present LAP-3B, which to the best of our knowledge is the first VLA to achieve substantial zero-shot transfer to previously unseen robot embodiments without any embodiment-specific fine-tuning. Across multiple novel robots and manipulation tasks, LAP-3B attains over 50% average zero-shot success, delivering roughly a 2x improvement over the strongest prior VLAs. We further show that LAP enables efficient adaptation and favorable scaling, while unifying action prediction and VQA in a shared language-action format that yields additional gains through co-training.
Paper Structure (32 sections, 9 equations, 12 figures, 5 tables)

This paper contains 32 sections, 9 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: We introduce Language-Action Pre-training (LAP), a general VLA pre-training recipe that represents low-level actions directly in natural language to supervise a vision–language backbone, and instantiate it as LAP-3B, the first VLA to demonstrate strong zero-shot transfer to novel embodiments. Compared to state-of-the-art VLAs, LAP-3B learns more generalizable embodiment representations and exhibits favorable scaling behavior.
  • Figure 2: (a) A unified view comparing LAP-3B with prior VLAs in terms of action representation. A VLM backbone predicts discrete language-action tokens using a cross-entropy objective, while a lightweight action expert predicts continuous actions via flow matching. Gradients from the action expert are blocked from the VLM through knowledge insulation driess2025knowledgeinsulatingvisionlanguageactionmodels, ensuring that the VLM is trained purely via language supervision. At test time, the action expert is rolled out for fast inference. (b) Visualizations of language-actions from the DROID dataset khazatsky2025droidlargescaleinthewildrobot.
  • Figure 3: Zero-shot cross-embodiment generalization performance. LAP-3B achieves performance comparable to the $\pi_{0.5}$-DROID on the seen embodiment. Across three previously unseen embodiments and six real-world manipulation tasks, LAP-3B attains over 50% average zero-shot success, delivering approximately a $2\times$ improvement over the strongest baselines, while all open-sourced VLAs collapse to zero success rate. Error bars denote 95% finite-sample–valid confidence intervals that control the Type-I error (miscoverage) probability, following vincent2024generalizablebehaviorcloningpolicy.
  • Figure 4: Fine-tuning efficiency in simulation (LIBERO) and real-world manipulation tasks. Across both domains, LAP-3B adapts substantially faster than baseline policies, reaching high performance with significantly fewer epochs and demonstrations. In simulation, LAP-3B converges to near-optimal success within a fraction of the training steps required by baselines. On real robots, LAP-3B achieves comparable task performance using approximately $\bm{2.5\times}$ fewer demonstrations, demonstrating substantially improved data and compute efficiency when transferring to new embodiments.
  • Figure 5: (a) T-SNE visualizations of learned embodiment representations for LAP-3B and $\pi_{0.5}$-replicated. LAP-3B exhibits substantial overlap between training and unseen embodiments, whereas $\pi_{0.5}$-replicated shows limited alignment, indicating that LAP-3B learns more transferable, embodiment-agnostic control representations. (b) Action prediction error on unseen embodiments during pre-training. LAP-3B achieves consistently lower action prediction error on held-out unseen embodiments throughout training, compared to $\pi_{0.5}$-replicated and $\pi_{0}$-replicated baselines. This indicates that language-action supervision enables the model to learn control representations that generalize across embodiments, allowing more accurate action prediction on novel robots as well as smoother training dynamics.
  • ...and 7 more figures