Table of Contents
Fetching ...

MoE-DP: An MoE-Enhanced Diffusion Policy for Robust Long-Horizon Robotic Manipulation with Skill Decomposition and Failure Recovery

Baiye Cheng, Tianhai Liang, Suning Huang, Maanping Shao, Feihong Zhang, Botian Xu, Zhengrong Xue, Huazhe Xu

TL;DR

Robotic diffusion policies excel at short-horizon visuomotor control but struggle with robustness and interpretability in long-horizon tasks. We introduce MoE-DP, which inserts a Mixture of Experts layer between the visual encoder and the diffusion model to create specialized, interpretable skills and enable recovery from subtask failures. The training objective combines the standard diffusion loss with auxiliary load-balancing and entropy terms to promote balanced, specialized expert usage, yielding a clear mapping from experts to task primitives. Across six simulated tasks and real-world robot experiments, MoE-DP achieves a 36% improvement in success under disturbances and supports inference-time control by rearranging expert activations without retraining. This approach enhances robustness, interpretability, and flexible control for long-horizon robotic manipulation.

Abstract

Diffusion policies have emerged as a powerful framework for robotic visuomotor control, yet they often lack the robustness to recover from subtask failures in long-horizon, multi-stage tasks and their learned representations of observations are often difficult to interpret. In this work, we propose the Mixture of Experts-Enhanced Diffusion Policy (MoE-DP), where the core idea is to insert a Mixture of Experts (MoE) layer between the visual encoder and the diffusion model. This layer decomposes the policy's knowledge into a set of specialized experts, which are dynamically activated to handle different phases of a task. We demonstrate through extensive experiments that MoE-DP exhibits a strong capability to recover from disturbances, significantly outperforming standard baselines in robustness. On a suite of 6 long-horizon simulation tasks, this leads to a 36% average relative improvement in success rate under disturbed conditions. This enhanced robustness is further validated in the real world, where MoE-DP also shows significant performance gains. We further show that MoE-DP learns an interpretable skill decomposition, where distinct experts correspond to semantic task primitives (e.g., approaching, grasping). This learned structure can be leveraged for inference-time control, allowing for the rearrangement of subtasks without any re-training.Our video and code are available at the https://moe-dp-website.github.io/MoE-DP-Website/.

MoE-DP: An MoE-Enhanced Diffusion Policy for Robust Long-Horizon Robotic Manipulation with Skill Decomposition and Failure Recovery

TL;DR

Robotic diffusion policies excel at short-horizon visuomotor control but struggle with robustness and interpretability in long-horizon tasks. We introduce MoE-DP, which inserts a Mixture of Experts layer between the visual encoder and the diffusion model to create specialized, interpretable skills and enable recovery from subtask failures. The training objective combines the standard diffusion loss with auxiliary load-balancing and entropy terms to promote balanced, specialized expert usage, yielding a clear mapping from experts to task primitives. Across six simulated tasks and real-world robot experiments, MoE-DP achieves a 36% improvement in success under disturbances and supports inference-time control by rearranging expert activations without retraining. This approach enhances robustness, interpretability, and flexible control for long-horizon robotic manipulation.

Abstract

Diffusion policies have emerged as a powerful framework for robotic visuomotor control, yet they often lack the robustness to recover from subtask failures in long-horizon, multi-stage tasks and their learned representations of observations are often difficult to interpret. In this work, we propose the Mixture of Experts-Enhanced Diffusion Policy (MoE-DP), where the core idea is to insert a Mixture of Experts (MoE) layer between the visual encoder and the diffusion model. This layer decomposes the policy's knowledge into a set of specialized experts, which are dynamically activated to handle different phases of a task. We demonstrate through extensive experiments that MoE-DP exhibits a strong capability to recover from disturbances, significantly outperforming standard baselines in robustness. On a suite of 6 long-horizon simulation tasks, this leads to a 36% average relative improvement in success rate under disturbed conditions. This enhanced robustness is further validated in the real world, where MoE-DP also shows significant performance gains. We further show that MoE-DP learns an interpretable skill decomposition, where distinct experts correspond to semantic task primitives (e.g., approaching, grasping). This learned structure can be leveraged for inference-time control, allowing for the rearrangement of subtasks without any re-training.Our video and code are available at the https://moe-dp-website.github.io/MoE-DP-Website/.

Paper Structure

This paper contains 19 sections, 9 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: MoE-DP enables robust recovery, interpretable skill decomposition, and high-level control for long-horizon manipulation. The baseline DP fails under disturbances such as object displacement. It lacks stage awareness, cannot recover from errors, overfits to subsequent trajectories, and cascades into further failures. MoE-DP learns an interpretable skill decomposition, with experts specializing in different skills, such as approaching, grasping, and placing. MoE-DP can detect failures and reactivate the correct expert to retry failed subtasks. Task order can be flexibly rearranged by controlling the sequence of expert activations without re-training, such as executing subtask 2 before subtask 1. Colored overlays indicate expert activations and the stage of subtasks.
  • Figure 2: Overview of MoE-DP with high-level guidance. In its autonomous mode, the system encodes observation inputs (images and robot state) into a feature vector, which is then fed to an MoE layer. The MoE's router automatically selects the appropriate expert for the current observation. The output of the selected expert then serves as a conditioning input for the Diffusion Policy during action generation. While the router typically operates autonomously, the architecture supports high-level control: an external agent, such as a human operator or a Vision-Language Model (VLM), can guide the policy by overriding the router's default selection. This capability enables flexible behaviors, such as reordering subtasks to generalize to novel sequences not seen during training.
  • Figure 3: Inference-time control via compositional skill decomposition. We demonstrate that MoE-DP learns modular and reusable skills that can be flexibly recombined to form novel behaviors without re-training. (Top) When executing a task in its demonstrated order, the policy decomposes the process into three distinct subtasks—Subtask 1 (picking the yellow duck), Subtask 2 (picking the strawberry), and Subtask 3 (moving the bowl)—each invoking a consistent sequence of expert activations. (Bottom) At inference time, we manually command a novel sequence by altering the subtask order (Subtask 2 followed by Subtask 1). The policy successfully executes this new task by reordering the learned skill modules. Crucially, the expert activation pattern for an individual subtask (e.g., Subtask 1) remains consistent across both scenarios, proving that MoE-DP learns truly compositional skills that enable generalization to new task structures.
  • Figure 4: Overview of the VLM-based planning and control framework. Our system leverages a VLM for high-level task planning in two stages. First, at the skill summarization (①) stage, the VLM builds a textual knowledge base of the robot's capabilities by analyzing annotated frames from a demonstration that follows the same execution sequence as the training data. Second, at the task execution stage (②), the VLM uses this knowledge, a high-level goal, and a real-time image to reason about the current task stage and predict the appropriate expert to activate. This hierarchical architecture enables the system to dynamically plan and rearrange the order of subtasks without any re-training, translating abstract goals into concrete robotic actions.