Table of Contents
Fetching ...

Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

Moritz Reuss, Jyothish Pari, Pulkit Agrawal, Rudolf Lioutikov

TL;DR

This work tackles the high computational cost of diffusion-policy-based imitation learning by introducing MoDE, a Mixture-of-Denoising-Experts architecture that uses sparse routing conditioned on noise levels to activate specialized experts. By caching expert combinations per denoising phase, MoDE achieves up to 90% fewer FLOPs and faster inference while maintaining or improving task performance across 134 multitask robotics benchmarks, including CALVIN and LIBERO. Pretraining on diverse robotic data further enhances zero-shot generalization, achieving state-of-the-art results such as an average rollout length of 4.01 on CALVIN ABC→D. Comprehensive ablations confirm the importance of noise-conditioned routing and demonstrate how expert distribution aligns with denoising stages, offering design insights for scalable diffusion-transformer architectures in multitask imitation.

Abstract

Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior. As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws. Therefore, continuing with the current architectures will present a computational roadblock. To address this gap, we propose Mixture-of-Denoising Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current state-of-the-art Transformer-based Diffusion Policies while enabling parameter-efficient scaling through sparse experts and noise-conditioned routing, reducing both active parameters by 40% and inference costs by 90% via expert caching. Our architecture combines this efficient scaling with noise-conditioned self-attention mechanism, enabling more effective denoising across different noise levels. MoDE achieves state-of-the-art performance on 134 tasks in four established imitation learning benchmarks (CALVIN and LIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01 on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based and Transformer Diffusion Policies by an average of 57% across 4 benchmarks, while using 90% fewer FLOPs and fewer active parameters compared to default Diffusion Transformer architectures. Furthermore, we conduct comprehensive ablations on MoDE's components, providing insights for designing efficient and scalable Transformer architectures for Diffusion Policies. Code and demonstrations are available at https://mbreuss.github.io/MoDE_Diffusion_Policy/.

Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

TL;DR

This work tackles the high computational cost of diffusion-policy-based imitation learning by introducing MoDE, a Mixture-of-Denoising-Experts architecture that uses sparse routing conditioned on noise levels to activate specialized experts. By caching expert combinations per denoising phase, MoDE achieves up to 90% fewer FLOPs and faster inference while maintaining or improving task performance across 134 multitask robotics benchmarks, including CALVIN and LIBERO. Pretraining on diverse robotic data further enhances zero-shot generalization, achieving state-of-the-art results such as an average rollout length of 4.01 on CALVIN ABC→D. Comprehensive ablations confirm the importance of noise-conditioned routing and demonstrate how expert distribution aligns with denoising stages, offering design insights for scalable diffusion-transformer architectures in multitask imitation.

Abstract

Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior. As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws. Therefore, continuing with the current architectures will present a computational roadblock. To address this gap, we propose Mixture-of-Denoising Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current state-of-the-art Transformer-based Diffusion Policies while enabling parameter-efficient scaling through sparse experts and noise-conditioned routing, reducing both active parameters by 40% and inference costs by 90% via expert caching. Our architecture combines this efficient scaling with noise-conditioned self-attention mechanism, enabling more effective denoising across different noise levels. MoDE achieves state-of-the-art performance on 134 tasks in four established imitation learning benchmarks (CALVIN and LIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01 on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based and Transformer Diffusion Policies by an average of 57% across 4 benchmarks, while using 90% fewer FLOPs and fewer active parameters compared to default Diffusion Transformer architectures. Furthermore, we conduct comprehensive ablations on MoDE's components, providing insights for designing efficient and scalable Transformer architectures for Diffusion Policies. Code and demonstrations are available at https://mbreuss.github.io/MoDE_Diffusion_Policy/.

Paper Structure

This paper contains 33 sections, 9 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: The proposed MoDE architecture (left) uses a transformer with causal masking, where each block includes noise-conditional self-attention and a noise-conditioned router that assigns tokens to expert models based on the noise level. This design enables efficient, scalable action generation. On the right, the router's activation of subsets of simple MLP experts with Swish-GLU activation during denoising is illustrated.
  • Figure 2: After training MoDE, the router is noise-conditioned, allowing pre-computation of the experts used at each noise level before inference. This enables removing the router and retaining only the selected experts, significantly improving network efficiency.
  • Figure 3: Visualization and Results for LIBERO environment. (a) Few example environments and tasks of the LIBERO-90 task suite. (b) Average results for both LIBERO challenges averaged over $3$ seeds with $20$ rollouts for each task.
  • Figure 4: Overview of the CALVIN environment. (a) CALVIN contains four different settings (A,B,C,D) with different configurations of slides, drawers and textures. (b) Example rollout consisting of $5$ tasks in sequence. The next goal is only given to the policy, if it manages to complete the prior.
  • Figure 5: Computational efficiency comparison between and a Dense-Transformer model with the same number of parameters. Left: Average inference speed over 100 forward passes for various batch sizes. Right: FLOP count for with router cache and without compared against a dense baseline. demonstrates superior efficiency with lower FLOP count and faster inference thanks to its router caching and sparse expert design.
  • ...and 7 more figures