Table of Contents
Fetching ...

Staleness-Centric Optimizations for Parallel Diffusion MoE Inference

Jiajun Luo, Lizhuo Luo, Jianru Xu, Jiajun Song, Rongwei Lu, Chen Tang, Zhi Wang

TL;DR

Staleness in MoE diffusion model inference hinders performance when using expert-parallelism. DICE presents a cohesive framework with Interweaved Parallelism, Selective Synchronization, and Conditional Communication to reduce stale activations at step, layer, and token levels, delivering up to 1.26× speedups with minimal quality loss on DiT-MoE variants. Through extensive experiments on ImageNet 256×256, DICE outperforms Displaced Parallelism and DistriFusion in latency and memory efficiency, while preserving image fidelity. The approach demonstrates practical scalability benefits for large MoE diffusion models and offers a path toward more efficient serving of diffusion-based generators.

Abstract

Mixture-of-Experts-based (MoE-based) diffusion models demonstrate remarkable scalability in high-fidelity image generation, yet their reliance on expert parallelism introduces critical communication bottlenecks. State-of-the-art methods alleviate such overhead in parallel diffusion inference through computation-communication overlapping, termed displaced parallelism. However, we identify that these techniques induce severe *staleness*-the usage of outdated activations from previous timesteps that significantly degrades quality, especially in expert-parallel scenarios. We tackle this fundamental tension and propose DICE, a staleness-centric optimization framework with a three-fold approach: (1) Interweaved Parallelism introduces staggered pipelines, effectively halving step-level staleness for free; (2) Selective Synchronization operates at layer-level and protects layers vulnerable from staled activations; and (3) Conditional Communication, a token-level, training-free method that dynamically adjusts communication frequency based on token importance. Together, these strategies effectively reduce staleness, achieving 1.26x speedup with minimal quality degradation. Empirical results establish DICE as an effective and scalable solution. Our code is publicly available at https://github.com/Cobalt-27/DICE

Staleness-Centric Optimizations for Parallel Diffusion MoE Inference

TL;DR

Staleness in MoE diffusion model inference hinders performance when using expert-parallelism. DICE presents a cohesive framework with Interweaved Parallelism, Selective Synchronization, and Conditional Communication to reduce stale activations at step, layer, and token levels, delivering up to 1.26× speedups with minimal quality loss on DiT-MoE variants. Through extensive experiments on ImageNet 256×256, DICE outperforms Displaced Parallelism and DistriFusion in latency and memory efficiency, while preserving image fidelity. The approach demonstrates practical scalability benefits for large MoE diffusion models and offers a path toward more efficient serving of diffusion-based generators.

Abstract

Mixture-of-Experts-based (MoE-based) diffusion models demonstrate remarkable scalability in high-fidelity image generation, yet their reliance on expert parallelism introduces critical communication bottlenecks. State-of-the-art methods alleviate such overhead in parallel diffusion inference through computation-communication overlapping, termed displaced parallelism. However, we identify that these techniques induce severe *staleness*-the usage of outdated activations from previous timesteps that significantly degrades quality, especially in expert-parallel scenarios. We tackle this fundamental tension and propose DICE, a staleness-centric optimization framework with a three-fold approach: (1) Interweaved Parallelism introduces staggered pipelines, effectively halving step-level staleness for free; (2) Selective Synchronization operates at layer-level and protects layers vulnerable from staled activations; and (3) Conditional Communication, a token-level, training-free method that dynamically adjusts communication frequency based on token importance. Together, these strategies effectively reduce staleness, achieving 1.26x speedup with minimal quality degradation. Empirical results establish DICE as an effective and scalable solution. Our code is publicly available at https://github.com/Cobalt-27/DICE

Paper Structure

This paper contains 24 sections, 3 equations, 13 figures, 5 tables, 4 algorithms.

Figures (13)

  • Figure 1: Execution flows of (a) synchronous expert parallelism (no staleness); (b) displaced variant of expert parallelism (two-step staleness, proposed in DistriFusion) distri and (c) our interweaved parallelism (one-step staleness). Each color denotes a different layer, showing how interweaving reduces staleness by staggering operations within the same step. The blue path highlights layer $N$, which spans two more steps in displaced parallelism but only one in interweaved. FID and speedup refer to interweaved parallelism alone (Section \ref{['sec:tradeoff']}).
  • Figure 2: Visualization of expert parallelism. Each device holds a subset of experts (Feed-Forward Networks, FFNs) and processes a portion of the input batch. Different colors denote distinct data samples.
  • Figure 3: Step-wise similarity heatmaps (cosine similarity between diffusion steps) in DiT-MoE, with both axes as diffusion steps. Results are shown for Layer 5 and 25, using one-hot routing assignments for similarity computation.
  • Figure 4: Overview of DICE. The blue path highlights layer N’s dataflow. Interweaved Parallelism interleaves operations; This execution pattern reducing the steps that cause staleness from two to one, half of displaced expert parallelism distri. Selective Synchronization targets staleness-vulnerable deeper layers, and Conditional Communication prioritizes important tokens based on router scores.
  • Figure 5: Visual comparison of synchronization strategies in DiT-MoE-XL. Synchronizing only the deep layers provides the most effective optimization. FID shown for cfg = 1.5.
  • ...and 8 more figures