Embodiment Transfer Learning for Vision-Language-Action Models

Chengmeng Li; Yaxin Peng

Embodiment Transfer Learning for Vision-Language-Action Models

Chengmeng Li, Yaxin Peng

TL;DR

This work tackles transferring Vision-Language-Action models to multi-robot settings by identifying token-count and planning failures in autoregressive VLAs. It introduces Synthetic Continued Pretraining (SCP) to synthesize multi-robot data and enforce correct action-token counts, and Embodied Graph-of-Thought (EGoT) to encode explicit task dependencies for coordinated execution. The proposed ET-VLA framework yields substantial gains over OpenVLA and diffusion-based baselines in both real-bimanual robot experiments and simulation benchmarks, with real-world average success rising from $6.49\%$ (ablations) to $59.74\%$. By providing interpretable planning cues via EGoT and cost-effective pretraining via SCP, this approach offers a practical path to robust, multi-robot VLA deployments and will be valuable to developers and researchers in embodied AI.

Abstract

Vision-language-action (VLA) models have significantly advanced robotic learning, enabling training on large-scale, cross-embodiment data and fine-tuning for specific robots. However, state-of-the-art autoregressive VLAs struggle with multi-robot collaboration. We introduce embodiment transfer learning, denoted as ET-VLA, a novel framework for efficient and effective transfer of pre-trained VLAs to multi-robot. ET-VLA's core is Synthetic Continued Pretraining (SCP), which uses synthetically generated data to warm up the model for the new embodiment, bypassing the need for real human demonstrations and reducing data collection costs. SCP enables the model to learn correct actions and precise action token numbers. Following SCP, the model is fine-tuned on target embodiment data. To further enhance the model performance on multi-embodiment, we present the Embodied Graph-of-Thought technique, a novel approach that formulates each sub-task as a node, that allows the VLA model to distinguish the functionalities and roles of each embodiment during task execution. Our work considers bimanual robots, a simple version of multi-robot to verify our approaches. We validate the effectiveness of our method on both simulation benchmarks and real robots covering three different bimanual embodiments. In particular, our proposed ET-VLA \space can outperform OpenVLA on six real-world tasks over 53.2%. We will open-source all codes to support the community in advancing VLA models for robot learning.

Embodiment Transfer Learning for Vision-Language-Action Models

TL;DR

Abstract

Embodiment Transfer Learning for Vision-Language-Action Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)