Table of Contents
Fetching ...

HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies

Zhiying Du, Bei Liu, Yaobo Liang, Yichao Shen, Haidong Cao, Xiangyu Zheng, Zhiyuan Feng, Zuxuan Wu, Jiaolong Yang, Yu-Gang Jiang

TL;DR

HiMoE-VLA tackles heterogeneous robotic data by introducing a Hierarchical Mixture-of-Experts action module that separates action-space variation from broader embodiment and sensor heterogeneity. It pairs a pretrained vision-language backbone with AS-MoE and HB-MoE layers, guided by flow-matching and regularizations to enable cross-domain generalization. Pretraining on large Open X-Embodiment and Aloha datasets, followed by fine-tuning on CALVIN and LIBERO and testing on real robots, demonstrates state-of-the-art performance and robust adaptation to unseen objects and environments. The work highlights effective knowledge transfer across diverse embodiments and action spaces, advancing robust embodied AI systems.

Abstract

The development of foundation models for embodied intelligence critically depends on access to large-scale, high-quality robot demonstration data. Recent approaches have sought to address this challenge by training on large collections of heterogeneous robotic datasets. However, unlike vision or language data, robotic demonstrations exhibit substantial heterogeneity across embodiments and action spaces as well as other prominent variations such as senor configurations and action control frequencies. The lack of explicit designs for handling such heterogeneity causes existing methods to struggle with integrating diverse factors, thereby limiting their generalization and leading to degraded performance when transferred to new settings. In this paper, we present HiMoE-VLA, a novel vision-language-action (VLA) framework tailored to effectively handle diverse robotic data with heterogeneity. Specifically, we introduce a Hierarchical Mixture-of-Experts (HiMoE) architecture for the action module which adaptively handles multiple sources of heterogeneity across layers and gradually abstracts them into shared knowledge representations. Through extensive experimentation with simulation benchmarks and real-world robotic platforms, HiMoE-VLA demonstrates a consistent performance boost over existing VLA baselines, achieving higher accuracy and robust generalization across diverse robots and action spaces. The code and models are publicly available at https://github.com/ZhiyingDu/HiMoE-VLA.

HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies

TL;DR

HiMoE-VLA tackles heterogeneous robotic data by introducing a Hierarchical Mixture-of-Experts action module that separates action-space variation from broader embodiment and sensor heterogeneity. It pairs a pretrained vision-language backbone with AS-MoE and HB-MoE layers, guided by flow-matching and regularizations to enable cross-domain generalization. Pretraining on large Open X-Embodiment and Aloha datasets, followed by fine-tuning on CALVIN and LIBERO and testing on real robots, demonstrates state-of-the-art performance and robust adaptation to unseen objects and environments. The work highlights effective knowledge transfer across diverse embodiments and action spaces, advancing robust embodied AI systems.

Abstract

The development of foundation models for embodied intelligence critically depends on access to large-scale, high-quality robot demonstration data. Recent approaches have sought to address this challenge by training on large collections of heterogeneous robotic datasets. However, unlike vision or language data, robotic demonstrations exhibit substantial heterogeneity across embodiments and action spaces as well as other prominent variations such as senor configurations and action control frequencies. The lack of explicit designs for handling such heterogeneity causes existing methods to struggle with integrating diverse factors, thereby limiting their generalization and leading to degraded performance when transferred to new settings. In this paper, we present HiMoE-VLA, a novel vision-language-action (VLA) framework tailored to effectively handle diverse robotic data with heterogeneity. Specifically, we introduce a Hierarchical Mixture-of-Experts (HiMoE) architecture for the action module which adaptively handles multiple sources of heterogeneity across layers and gradually abstracts them into shared knowledge representations. Through extensive experimentation with simulation benchmarks and real-world robotic platforms, HiMoE-VLA demonstrates a consistent performance boost over existing VLA baselines, achieving higher accuracy and robust generalization across diverse robots and action spaces. The code and models are publicly available at https://github.com/ZhiyingDu/HiMoE-VLA.

Paper Structure

This paper contains 34 sections, 10 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Overview of HiMoE-VLA. The left blue part illustrates the VLM backbone initialized from PaliGemma beyer2024paligemma, and the right orange part depicts our proposed action module with a novel Hierarchical Mixture-of-Experts (HiMoE), which is responsible for processing different robot states and noisy actions and generating final action outputs.
  • Figure 2: Detailed structure of the Hierarchical Mixture-of-Experts (HiMoE). The architecture follows a layered hierarchy: AS-MoE modules at the boundaries specialize in action-space variations, adjacent HB-MoE modules address broader heterogeneity, and the central Transformer blocks serve as shared layers for cross-domain knowledge integration.
  • Figure 3: Qualitative examples of real-world executions on (left) the single-arm xArm7 and (right) the dual-arm Aloha robot. The snapshots cover representative stages across tasks such as Fruit-to-Plate, Block-on-Block, Cup-Handover, and Scoop.
  • Figure 4: Expert Activation Heatmap of AS-MoE.
  • Figure 5: Expert Activation Heatmap of HB-MoE.
  • ...and 2 more figures