MoE-ACT: Scaling Multi-Task Bimanual Manipulation with Sparse Language-Conditioned Mixture-of-Experts Transformers

Kangjun Guo; Haichao Liu; Yanji Sun; Ruhan Zhao; Jinni Zhou; Jun Ma

MoE-ACT: Scaling Multi-Task Bimanual Manipulation with Sparse Language-Conditioned Mixture-of-Experts Transformers

Kangjun Guo, Haichao Liu, Yanji Sun, Ruhan Zhao, Jinni Zhou, Jun Ma

Abstract

The ability of robots to handle multiple tasks under a unified policy is critical for deploying embodied intelligence in real-world household and industrial applications. However, out-of-distribution variation across tasks often causes severe task interference and negative transfer when training general robotic policies. To address this challenge, we propose a lightweight multi-task imitation learning framework for bimanual manipulation, termed Mixture-of-Experts-Enhanced Action Chunking Transformer (MoE-ACT), which integrates sparse Mixture-of-Experts (MoE) modules into the Transformer encoder of ACT. The MoE layer decomposes a unified task policy into independently invoked expert components. Through adaptive activation, it naturally decouples multi-task action distributions in latent space. During decoding, Feature-wise Linear Modulation (FiLM) dynamically modulates action tokens to improve consistency between action generation and task instructions. In parallel, multi-scale cross-attention enables the policy to simultaneously focus on both low-level and high-level semantic features, providing rich visual information for robotic manipulation. We further incorporate textual information, transitioning the framework from a purely vision-based model to a vision-centric, language-conditioned action generation system. Experimental validation in both simulation and a real-world dual-arm setup shows that MoE-ACT substantially improves multi-task performance. Specifically, MoE-ACT outperforms vanilla ACT by an average of 33% in success rate. These results indicate that MoE-ACT provides stronger robustness and generalization in complex multi-task bimanual manipulation environments. Our open-source project page can be found at https://j3k7.github.io/MoE-ACT/.

MoE-ACT: Scaling Multi-Task Bimanual Manipulation with Sparse Language-Conditioned Mixture-of-Experts Transformers

Abstract

Paper Structure (25 sections, 10 equations, 5 figures, 3 tables)

This paper contains 25 sections, 10 equations, 5 figures, 3 tables.

Introduction
Related Works
Imitation Learning for Robotic Manipulation
Multi-task Learning in Robotics
Methodology
Problem Formulation and Overview
MoE-Enhanced Encoder
Task-Conditioned Transformer Decoder
Task-conditioned FiLM
Multi-Scale Cross-Attention Layer
Training Objectives
CVAE Loss
MoE Auxiliary Loss
Experimental Results and Analysis
Simulation Results
...and 10 more sections

Figures (5)

Figure 1: Overview of MoE-ACT. The architecture consists of the MoE module integrated into the Transformer encoder and a FiLM mechanism in the decoder. The MoE module enables task-specific feature decoupling, while FiLM ensures that action generation is consistent with task instructions. Multi-scale cross-attention allows the model to capture both high-level semantics and low-level visual details for manipulation control.
Figure 2: Multi-task learning experiments in RoboTwin 2.0. MoE-ACT demonstrates superior performance across all six tasks, significantly outperforming the original ACT and other baselines. The line chart illustrates the evolution of selection weights for different experts over time.
Figure 3: Attention heatmap of MoE-ACT on RoboTwin 2.0. The top row displays attention on intermediate-level visual features, while the bottom row shows attention on final-level contextualized features across five decoder layers (L1--L5). The color intensity corresponds to the attention magnitude, where brighter regions indicate higher values.
Figure 4: Real-world task setup. (a) shows the dual-arm handover task, while (b) illustrates the task of putting cubes into a box.
Figure 5: Real-world task definitions. (a) shows the "Putting cubes into a box" task, which requires the robot to pick up cubes and place them into a box. (b) shows the "Handover bottle" task, where the robot must grasp a bottle and hand it to a human. These tasks evaluate the multi-task learning capability of MoE-ACT in real-world bimanual manipulation scenarios.

MoE-ACT: Scaling Multi-Task Bimanual Manipulation with Sparse Language-Conditioned Mixture-of-Experts Transformers

Abstract

MoE-ACT: Scaling Multi-Task Bimanual Manipulation with Sparse Language-Conditioned Mixture-of-Experts Transformers

Authors

Abstract

Table of Contents

Figures (5)