Table of Contents
Fetching ...

E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources

Tong Shen, Jingai Yu, Dong Zhou, Dong Li, Emad Barsoum

TL;DR

E-MMDiT tackles the resource bottleneck of diffusion-based image synthesis by designing a lightweight multimodal diffusion transformer that emphasizes token reduction and efficient attention. It combines a highly compressive DC-AE tokenizer with a novel multi-path token compression, Position Reinforcement, Alternating Subregion Attention, and AdaLN-affine to reduce computation while preserving spatial coherence across modalities. The model is trained from scratch on publicly available data using Rectified Flow with a representation alignment loss, achieving competitive GenEval scores and substantially higher throughput, including 512px and 1024px generation, within 1.5 days on 8 AMD MI300X GPUs. The combination of design choices and ablations positions E-MMDiT as a practical baseline for efficient diffusion-based generation and democratized access to high-quality image synthesis, with code released for reproducibility.

Abstract

Diffusion models have shown strong capabilities in generating high-quality images from text prompts. However, these models often require large-scale training data and significant computational resources to train, or suffer from heavy structure with high latency. To this end, we propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis requiring low training resources. We provide an easily reproducible baseline with competitive results. Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO. Our design philosophy centers on token reduction as the computational cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence, and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an efficient lightweight module for computing modulation parameters in transformer blocks. Our code is available at https://github.com/AMD-AGI/Nitro-E and we hope E-MMDiT serves as a strong and practical baseline for future research and contributes to democratization of generative AI models.

E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources

TL;DR

E-MMDiT tackles the resource bottleneck of diffusion-based image synthesis by designing a lightweight multimodal diffusion transformer that emphasizes token reduction and efficient attention. It combines a highly compressive DC-AE tokenizer with a novel multi-path token compression, Position Reinforcement, Alternating Subregion Attention, and AdaLN-affine to reduce computation while preserving spatial coherence across modalities. The model is trained from scratch on publicly available data using Rectified Flow with a representation alignment loss, achieving competitive GenEval scores and substantially higher throughput, including 512px and 1024px generation, within 1.5 days on 8 AMD MI300X GPUs. The combination of design choices and ablations positions E-MMDiT as a practical baseline for efficient diffusion-based generation and democratized access to high-quality image synthesis, with code released for reproducibility.

Abstract

Diffusion models have shown strong capabilities in generating high-quality images from text prompts. However, these models often require large-scale training data and significant computational resources to train, or suffer from heavy structure with high latency. To this end, we propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis requiring low training resources. We provide an easily reproducible baseline with competitive results. Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO. Our design philosophy centers on token reduction as the computational cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence, and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an efficient lightweight module for computing modulation parameters in transformer blocks. Our code is available at https://github.com/AMD-AGI/Nitro-E and we hope E-MMDiT serves as a strong and practical baseline for future research and contributes to democratization of generative AI models.

Paper Structure

This paper contains 23 sections, 3 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison with other models on GenEval and throughput. Throughput is measured by generating 512px images using a batch size of 32 and 20 steps on an AMD MI300X GPU. Despite having only 304M parameters, our model achieves competitive GenEval performance and a clear advantage in throughput.
  • Figure 2: Images generated by our 304M E-MMDiT model at 512px (top) and 1024px (bottom).
  • Figure 3: Illustration of E-MMDiT. Image is encoded by a highly compressive tokenizer DC-AE (ratio of 32$\times$), and prompt is encoded by a light-weight LLM, Llama3.2-1B. Our E-MMDiT blocks are incorporated with ASA for faster token interaction. After first $N_1$ blocks, the tokens are further condensed by our multi-path compression module with ratio of 2$\times$ and 4$\times$ for the following $N_2$ blocks. The tokens are finally recovered by the token reconstructor and processed by the final $N_3$ blocks. Positional Embedding is additionally added to the reconstructed tokens for position reinforcement. AdaLN-affine encodes timestep and provides modulation parameters for each block through an affine transformation of the global vector.
  • Figure 4: Illustration of ASA with two consecutive blocks. Tokens are represented as one dimensional sequences for simplicity. Left side depicts the downsampled attention in UDiT, where tokens are always divided by the same group pattern, lacking inter-group communication. Extra depthwise Convolutions are required in the FFN. In contrast, our proposed ASA shown on the right simply alternates grouping patterns for the second block. It is easy to observe that tokens grouped by the same color in the first block are reorganized into groups containing tokens of different colors, thus enabling interaction across subregions.
  • Figure 5: Visual comparison between the distilled and the full-step models. The 4-step results maintain the same visual quality as the original 20-step results.