MoE-DP: An MoE-Enhanced Diffusion Policy for Robust Long-Horizon Robotic Manipulation with Skill Decomposition and Failure Recovery
Baiye Cheng, Tianhai Liang, Suning Huang, Maanping Shao, Feihong Zhang, Botian Xu, Zhengrong Xue, Huazhe Xu
TL;DR
Robotic diffusion policies excel at short-horizon visuomotor control but struggle with robustness and interpretability in long-horizon tasks. We introduce MoE-DP, which inserts a Mixture of Experts layer between the visual encoder and the diffusion model to create specialized, interpretable skills and enable recovery from subtask failures. The training objective combines the standard diffusion loss with auxiliary load-balancing and entropy terms to promote balanced, specialized expert usage, yielding a clear mapping from experts to task primitives. Across six simulated tasks and real-world robot experiments, MoE-DP achieves a 36% improvement in success under disturbances and supports inference-time control by rearranging expert activations without retraining. This approach enhances robustness, interpretability, and flexible control for long-horizon robotic manipulation.
Abstract
Diffusion policies have emerged as a powerful framework for robotic visuomotor control, yet they often lack the robustness to recover from subtask failures in long-horizon, multi-stage tasks and their learned representations of observations are often difficult to interpret. In this work, we propose the Mixture of Experts-Enhanced Diffusion Policy (MoE-DP), where the core idea is to insert a Mixture of Experts (MoE) layer between the visual encoder and the diffusion model. This layer decomposes the policy's knowledge into a set of specialized experts, which are dynamically activated to handle different phases of a task. We demonstrate through extensive experiments that MoE-DP exhibits a strong capability to recover from disturbances, significantly outperforming standard baselines in robustness. On a suite of 6 long-horizon simulation tasks, this leads to a 36% average relative improvement in success rate under disturbed conditions. This enhanced robustness is further validated in the real world, where MoE-DP also shows significant performance gains. We further show that MoE-DP learns an interpretable skill decomposition, where distinct experts correspond to semantic task primitives (e.g., approaching, grasping). This learned structure can be leveraged for inference-time control, allowing for the rearrangement of subtasks without any re-training.Our video and code are available at the https://moe-dp-website.github.io/MoE-DP-Website/.
