Table of Contents
Fetching ...

Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation

Youwei Zheng, Yuxi Ren, Xin Xia, Xuefeng Xiao, Xiaohua Xie

TL;DR

Large diffusion transformers incur high inference costs, motivating a Shift to structured sparsity. Dense2MoE converts a dense Diffusion Transformer (DiT) into a Mixture of Experts (MoE) and introduces Mixture of Blocks (MoB) to further reduce activated parameters without sacrificing capacity. The method replaces FFNs with MoE layers, groups blocks into MoB, and employs a three-stage distillation pipeline (Taylor-based initialization, load-balanced KD, and group-feature distillation) to recover performance. Experiments show substantial activation-parameter reductions (5.2B–2.6B, up to 62.5% compression) with performance on par with or better than pruning baselines, highlighting Dense2MoE as an effective route to efficient text-to-image diffusion.

Abstract

Diffusion Transformer (DiT) has demonstrated remarkable performance in text-to-image generation; however, its large parameter size results in substantial inference overhead. Existing parameter compression methods primarily focus on pruning, but aggressive pruning often leads to severe performance degradation due to reduced model capacity. To address this limitation, we pioneer the transformation of a dense DiT into a Mixture of Experts (MoE) for structured sparsification, reducing the number of activated parameters while preserving model capacity. Specifically, we replace the Feed-Forward Networks (FFNs) in DiT Blocks with MoE layers, reducing the number of activated parameters in the FFNs by 62.5\%. Furthermore, we propose the Mixture of Blocks (MoB) to selectively activate DiT blocks, thereby further enhancing sparsity. To ensure an effective dense-to-MoE conversion, we design a multi-step distillation pipeline, incorporating Taylor metric-based expert initialization, knowledge distillation with load balancing, and group feature loss for MoB optimization. We transform large diffusion transformers (e.g., FLUX.1 [dev]) into an MoE structure, reducing activated parameters by 60\% while maintaining original performance and surpassing pruning-based approaches in extensive experiments. Overall, Dense2MoE establishes a new paradigm for efficient text-to-image generation.

Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation

TL;DR

Large diffusion transformers incur high inference costs, motivating a Shift to structured sparsity. Dense2MoE converts a dense Diffusion Transformer (DiT) into a Mixture of Experts (MoE) and introduces Mixture of Blocks (MoB) to further reduce activated parameters without sacrificing capacity. The method replaces FFNs with MoE layers, groups blocks into MoB, and employs a three-stage distillation pipeline (Taylor-based initialization, load-balanced KD, and group-feature distillation) to recover performance. Experiments show substantial activation-parameter reductions (5.2B–2.6B, up to 62.5% compression) with performance on par with or better than pruning baselines, highlighting Dense2MoE as an effective route to efficient text-to-image diffusion.

Abstract

Diffusion Transformer (DiT) has demonstrated remarkable performance in text-to-image generation; however, its large parameter size results in substantial inference overhead. Existing parameter compression methods primarily focus on pruning, but aggressive pruning often leads to severe performance degradation due to reduced model capacity. To address this limitation, we pioneer the transformation of a dense DiT into a Mixture of Experts (MoE) for structured sparsification, reducing the number of activated parameters while preserving model capacity. Specifically, we replace the Feed-Forward Networks (FFNs) in DiT Blocks with MoE layers, reducing the number of activated parameters in the FFNs by 62.5\%. Furthermore, we propose the Mixture of Blocks (MoB) to selectively activate DiT blocks, thereby further enhancing sparsity. To ensure an effective dense-to-MoE conversion, we design a multi-step distillation pipeline, incorporating Taylor metric-based expert initialization, knowledge distillation with load balancing, and group feature loss for MoB optimization. We transform large diffusion transformers (e.g., FLUX.1 [dev]) into an MoE structure, reducing activated parameters by 60\% while maintaining original performance and surpassing pruning-based approaches in extensive experiments. Overall, Dense2MoE establishes a new paradigm for efficient text-to-image generation.

Paper Structure

This paper contains 11 sections, 10 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The visual comparison between the 12B FLUX.1 [dev] and our FLUX.1-MoE models. The second, third, and fourth rows correspond to FLUX.1-MoE-L, FLUX.1-MoE-M, and FLUX.1-MoE-S, with 5.2B, 4B, and 3.2B activated parameters, respectively. These are sparse MoE models distilled from FLUX.1 [dev]. All images in each column are generated from the same random noise.
  • Figure 2: Comparison of activated parameters and performance between FLUX.1-MoEs and the baseline models: The left figure shows the activated parameters, as well as the parameter count of the major modules in these models. The right figure compares our MoEs with the baseline on the GenEval ghosh2023geneval benchmark.
  • Figure 3: Visualization of the MSE between the input and output of each single stream block in FLUX.1 [dev] under different prompts (left) and timesteps (right). The black line represents the average block MSE. The logarithm of the MSE is taken for better visualization.
  • Figure 4: The framework of Dense2MoE. The purple region represents the MoE layer, while the blue region denotes a MoB group. The pipeline comprises three stages: (a) Enhanced MoE Initialization, where MLP layers in DiT are restructured using a Taylor-based metric and knowledge distillation; (b) Dense-to-MoE Distillation, where assembles the enhanced weights into MoE and applies knowledge distillation with load balancing; (c) Group Feature Distillation for MoB, where blocks are grouped into MoB to further compress activated parameters by depth, with group features guiding the distillation.
  • Figure 5: Loss curves for different MoB group sizes during group feature distillation: 0–40K (left) and zoomed-in 20K–40K (right).
  • ...and 3 more figures