Table of Contents
Fetching ...

Embodiment Transfer Learning for Vision-Language-Action Models

Chengmeng Li, Yaxin Peng

TL;DR

This work tackles transferring Vision-Language-Action models to multi-robot settings by identifying token-count and planning failures in autoregressive VLAs. It introduces Synthetic Continued Pretraining (SCP) to synthesize multi-robot data and enforce correct action-token counts, and Embodied Graph-of-Thought (EGoT) to encode explicit task dependencies for coordinated execution. The proposed ET-VLA framework yields substantial gains over OpenVLA and diffusion-based baselines in both real-bimanual robot experiments and simulation benchmarks, with real-world average success rising from $6.49\%$ (ablations) to $59.74\%$. By providing interpretable planning cues via EGoT and cost-effective pretraining via SCP, this approach offers a practical path to robust, multi-robot VLA deployments and will be valuable to developers and researchers in embodied AI.

Abstract

Vision-language-action (VLA) models have significantly advanced robotic learning, enabling training on large-scale, cross-embodiment data and fine-tuning for specific robots. However, state-of-the-art autoregressive VLAs struggle with multi-robot collaboration. We introduce embodiment transfer learning, denoted as ET-VLA, a novel framework for efficient and effective transfer of pre-trained VLAs to multi-robot. ET-VLA's core is Synthetic Continued Pretraining (SCP), which uses synthetically generated data to warm up the model for the new embodiment, bypassing the need for real human demonstrations and reducing data collection costs. SCP enables the model to learn correct actions and precise action token numbers. Following SCP, the model is fine-tuned on target embodiment data. To further enhance the model performance on multi-embodiment, we present the Embodied Graph-of-Thought technique, a novel approach that formulates each sub-task as a node, that allows the VLA model to distinguish the functionalities and roles of each embodiment during task execution. Our work considers bimanual robots, a simple version of multi-robot to verify our approaches. We validate the effectiveness of our method on both simulation benchmarks and real robots covering three different bimanual embodiments. In particular, our proposed ET-VLA \space can outperform OpenVLA on six real-world tasks over 53.2%. We will open-source all codes to support the community in advancing VLA models for robot learning.

Embodiment Transfer Learning for Vision-Language-Action Models

TL;DR

This work tackles transferring Vision-Language-Action models to multi-robot settings by identifying token-count and planning failures in autoregressive VLAs. It introduces Synthetic Continued Pretraining (SCP) to synthesize multi-robot data and enforce correct action-token counts, and Embodied Graph-of-Thought (EGoT) to encode explicit task dependencies for coordinated execution. The proposed ET-VLA framework yields substantial gains over OpenVLA and diffusion-based baselines in both real-bimanual robot experiments and simulation benchmarks, with real-world average success rising from (ablations) to . By providing interpretable planning cues via EGoT and cost-effective pretraining via SCP, this approach offers a practical path to robust, multi-robot VLA deployments and will be valuable to developers and researchers in embodied AI.

Abstract

Vision-language-action (VLA) models have significantly advanced robotic learning, enabling training on large-scale, cross-embodiment data and fine-tuning for specific robots. However, state-of-the-art autoregressive VLAs struggle with multi-robot collaboration. We introduce embodiment transfer learning, denoted as ET-VLA, a novel framework for efficient and effective transfer of pre-trained VLAs to multi-robot. ET-VLA's core is Synthetic Continued Pretraining (SCP), which uses synthetically generated data to warm up the model for the new embodiment, bypassing the need for real human demonstrations and reducing data collection costs. SCP enables the model to learn correct actions and precise action token numbers. Following SCP, the model is fine-tuned on target embodiment data. To further enhance the model performance on multi-embodiment, we present the Embodied Graph-of-Thought technique, a novel approach that formulates each sub-task as a node, that allows the VLA model to distinguish the functionalities and roles of each embodiment during task execution. Our work considers bimanual robots, a simple version of multi-robot to verify our approaches. We validate the effectiveness of our method on both simulation benchmarks and real robots covering three different bimanual embodiments. In particular, our proposed ET-VLA \space can outperform OpenVLA on six real-world tasks over 53.2%. We will open-source all codes to support the community in advancing VLA models for robot learning.

Paper Structure

This paper contains 13 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of ET-VLA: we propose Synthetic Continued Pretraining (SCP) and Embodied Graph-of-Thought (EGoT) as key components: (1) SCP uses synthetically generated data to warm up the model for the new embodiment, bypassing the need for real human demonstrations and reducing data collection costs. SCP enables the model to learn correct actions and precise action token numbers. (2) Embodied GoT formulates each sub-task as a node, that allows the VLA model to distinguish the functionalities and roles of each embodiment during task execution, promoting more effective coordination.
  • Figure 2: (a) Real robot and all objects used in our work. We use two UR5 robot arms equipped with Robotiq grippers and incorporate a diverse set of everyday objects in our manipulation tasks. A RealSense D457 camera is applied to capture visual observations on the top. (b) Simulation. We conduct experiments on two simulation benchmarks, the RLBench2 grotz2024peract2 and RoboTwin mu2024robotwin. (c) Real robot tasks. We designed six collaborative multi-robot tasks for our real-world experiment.
  • Figure 3: (a) Learning efficiency. We show the learning curves of ET-VLA and OpenVLA in 6 real-world tasks. ET-VLA demonstrates a rapid convergence towards high accuracy. (b) Success rate over six tasks. We doubled the train data and extended the training duration by a factor of two for OpenVLA, referring to this result as OpenVLA (extra data). Under these conditions, ET-VLA outperforms OpenVLA (extra data) by 9.1% with 2 times less training time.
  • Figure 4: Example of Embodied Graph-of-Thought (EGoT). To facilitate better understanding for our readers, we provide only the simplified version of the prompt and tasks output.
  • Figure 5: EGoT is capable of handling complex task sequences. We deliberately remove the column from the gripper and place it back on the table. we observe the robot try to pick up again. And the model’s output transitions to ”pick up the pink column”.
  • ...and 1 more figures