Table of Contents
Fetching ...

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, Xianyuan Zhan

TL;DR

This work tackles the heterogeneity problem in cross-embodiment Vision-Language-Action learning by introducing X-VLA, a soft-prompted Transformer framework. It attaches embodiment-specific, learnable prompts to each data source, enabling a shared backbone to learn an embodiment-agnostic policy while absorbing hardware variations. The approach employs a flow-matching objective with a streamlined encoding pipeline that handles high-dimensional vision–language inputs and low-dimensional proprioceptive signals, accompanied by a two-stage adaptation procedure and enhanced data processing. Empirically, X-VLA-0.9B achieves state-of-the-art performance across six simulation benchmarks and three real robots, with efficient data- and parameter-efficient adaptation (e.g., 9M tunable parameters via LoRA) and a high-throughput dexterous cloth-folding capability, highlighting the method’s scalability and practical impact in generalized robotic intelligence.

Abstract

Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders, enjoying both scalability and simplicity. Evaluated across 6 simulations as well as 3 real-world robots, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves SOTA performance over a sweep of benchmarks, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks. Website: https://thu-air-dream.github.io/X-VLA/

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

TL;DR

This work tackles the heterogeneity problem in cross-embodiment Vision-Language-Action learning by introducing X-VLA, a soft-prompted Transformer framework. It attaches embodiment-specific, learnable prompts to each data source, enabling a shared backbone to learn an embodiment-agnostic policy while absorbing hardware variations. The approach employs a flow-matching objective with a streamlined encoding pipeline that handles high-dimensional vision–language inputs and low-dimensional proprioceptive signals, accompanied by a two-stage adaptation procedure and enhanced data processing. Empirically, X-VLA-0.9B achieves state-of-the-art performance across six simulation benchmarks and three real robots, with efficient data- and parameter-efficient adaptation (e.g., 9M tunable parameters via LoRA) and a high-throughput dexterous cloth-folding capability, highlighting the method’s scalability and practical impact in generalized robotic intelligence.

Abstract

Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders, enjoying both scalability and simplicity. Evaluated across 6 simulations as well as 3 real-world robots, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves SOTA performance over a sweep of benchmarks, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks. Website: https://thu-air-dream.github.io/X-VLA/

Paper Structure

This paper contains 31 sections, 1 equation, 16 figures, 16 tables.

Figures (16)

  • Figure 1: X-VLA employs distinctive learnable embeddings, referred to as soft prompt, to effectively address the heterogeneity present in cross-embodiment datasets. This approach, combined with stacking simple self-attention transformer blocks, provides a scalable solution for integrating diverse pretraining datasets and finetuning for a variety of domain-specific applications. Evaluated across 6 simulation benchmark including one autonomous driving bench and 3 real-world robots, X-VLA achieves SOTA performance over most benchmark suites and real-world robotic tasks.
  • Figure 2: Comparison among four methods in handling heterogeneity in cross-embodiment training.
  • Figure 3: The recipe for mixed data used in pretraining experiments.
  • Figure 4: Comparison of backbone architectures on validation error. X-VLA achieves the lowest error while maintaining stable training on heterogeneous datasets.
  • Figure 5: With increased compute, data diversity, and data volume, X-VLA can output reduced validation prediction error, which can lead to enhanced adaptation performance as discussed by Tab. \ref{['tab:ablation']}.
  • ...and 11 more figures