Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization

Ye Wang; Sipeng Zheng; Hao Luo; Wanpeng Zhang; Haoqi Yuan; Chaoyi Xu; Haiweng Xu; Yicheng Feng; Mingyang Yu; Zhiyu Kang; Zongqing Lu; Qin Jin

Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization

Ye Wang, Sipeng Zheng, Hao Luo, Wanpeng Zhang, Haoqi Yuan, Chaoyi Xu, Haiweng Xu, Yicheng Feng, Mingyang Yu, Zhiyu Kang, Zongqing Lu, Qin Jin

TL;DR

This work probes whether the common scale-data paradigm for Vision-Language-Action models translates to robotics with heterogeneous embodiments. It introduces a Mixture-of-Transformers with Flow-Matching, a unified End-Effector relative action space, and a Grouped Blind Ensemble evaluation to study physical alignment, embodiment mixture, and training regularization. Key findings show that End-Effector relative actions reliably enable cross-embodiment transfer, naive mixing of heterogeneous robot data often hurts, and regularization techniques may not provide consistent benefits, while the bias-reducing evaluation protocol enhances real-world reliability. The study yields practical guidelines for training large-scale VLA policies from diverse robotic data and cautions against indiscriminate data pooling, contributing to more robust, transfer-friendly embodied AI systems.

Abstract

While Vision-Language-Action (VLA) models show strong promise for generalist robot control, it remains unclear whether -- and under what conditions -- the standard "scale data" recipe translates to robotics, where training data is inherently heterogeneous across embodiments, sensors, and action spaces. We present a systematic, controlled study of VLA scaling that revisits core training choices for pretraining across diverse robots. Using a representative VLA framework that combines a vision-language backbone with flow-matching, we ablate key design decisions under matched conditions and evaluate in extensive simulation and real-robot experiments. To improve the reliability of real-world results, we introduce a Grouped Blind Ensemble protocol that blinds operators to model identity and separates policy execution from outcome judgment, reducing experimenter bias. Our analysis targets three dimensions of VLA scaling. (1) Physical alignment: we show that a unified end-effector (EEF)-relative action representation is critical for robust cross-embodiment transfer. (2) Embodiment mixture: we find that naively pooling heterogeneous robot datasets often induces negative transfer rather than gains, underscoring the fragility of indiscriminate data scaling. (3) Training regularization: we observe that intuitive strategies, such as sensory dropout and multi-stage fine-tuning, do not consistently improve performance at scale. Together, this study challenge some common assumptions about embodied scaling and provide practical guidance for training large-scale VLA policies from diverse robotic data. Project website: https://research.beingbeyond.com/rethink_vla

Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization

TL;DR

Abstract

Paper Structure (18 sections, 5 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 18 sections, 5 equations, 6 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Methodology and Evaluation Protocol
Mixture-of-Transformers with Flow-Matching Control
Grouped Blind Ensemble Evaluation
Pre-training Data and Implementation
Large-scale Heterogeneous Robot Data
Pretraining Implementation Details
Experiments
Exploration of Physical Alignment
Exploration of Embodiment Mixture
Exploration of Training Regularization
Comparison with Representative Generalist Policies.
Qualitative Analysis
Conclusion
...and 3 more sections

Figures (6)

Figure 1: Overview of our systematic VLA analysis framework, comprising the Mixture-of-Transformers architecture, physically aligned action spaces, and the Grouped Blind Ensemble protocol.
Figure 2: Composition of the balanced pre-training data.
Figure 3: Real-world experimental setup.
Figure 4: Real-world blind evaluation of different action spaces.
Figure 5: Real-world blind evaluation of different pre-training data mixtures.
...and 1 more figures

Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization

TL;DR

Abstract

Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization

Authors

TL;DR

Abstract

Table of Contents

Figures (6)