Table of Contents
Fetching ...

DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation

Chen Chen, Rui Qian, Wenze Hu, Tsu-Jui Fu, Jialing Tong, Xinze Wang, Lezhi Li, Bowen Zhang, Alex Schwing, Wei Liu, Yinfei Yang

TL;DR

This paper systematically evaluates Diffusion Transformer architectures for text-to-image generation, showing that a streamlined DiT with concatenated text and noise inputs, coupled with shared AdaLN, matches or surpasses more complex conditioning schemes at scale. It introduces DiT-Air and DiT-Air-Lite to maximize parameter efficiency, achieving substantial reductions in model size while maintaining high fidelity and alignment across benchmarks like GenEval and T2I CompBench. Through careful ablations of text encoders and VAEs, and a progressive multi-stage training pipeline including supervised and reward fine-tuning, the work delivers state-of-the-art performance with practical efficiency advantages. The findings offer actionable guidance for designing efficient, high-quality text-to-image diffusion systems suitable for large-scale deployment and real-world applications.

Abstract

In this work, we empirically study Diffusion Transformers (DiTs) for text-to-image generation, focusing on architectural choices, text-conditioning strategies, and training protocols. We evaluate a range of DiT-based architectures--including PixArt-style and MMDiT variants--and compare them with a standard DiT variant which directly processes concatenated text and noise inputs. Surprisingly, our findings reveal that the performance of standard DiT is comparable with those specialized models, while demonstrating superior parameter-efficiency, especially when scaled up. Leveraging the layer-wise parameter sharing strategy, we achieve a further reduction of 66% in model size compared to an MMDiT architecture, with minimal performance impact. Building on an in-depth analysis of critical components such as text encoders and Variational Auto-Encoders (VAEs), we introduce DiT-Air and DiT-Air-Lite. With supervised and reward fine-tuning, DiT-Air achieves state-of-the-art performance on GenEval and T2I CompBench, while DiT-Air-Lite remains highly competitive, surpassing most existing models despite its compact size.

DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation

TL;DR

This paper systematically evaluates Diffusion Transformer architectures for text-to-image generation, showing that a streamlined DiT with concatenated text and noise inputs, coupled with shared AdaLN, matches or surpasses more complex conditioning schemes at scale. It introduces DiT-Air and DiT-Air-Lite to maximize parameter efficiency, achieving substantial reductions in model size while maintaining high fidelity and alignment across benchmarks like GenEval and T2I CompBench. Through careful ablations of text encoders and VAEs, and a progressive multi-stage training pipeline including supervised and reward fine-tuning, the work delivers state-of-the-art performance with practical efficiency advantages. The findings offer actionable guidance for designing efficient, high-quality text-to-image diffusion systems suitable for large-scale deployment and real-world applications.

Abstract

In this work, we empirically study Diffusion Transformers (DiTs) for text-to-image generation, focusing on architectural choices, text-conditioning strategies, and training protocols. We evaluate a range of DiT-based architectures--including PixArt-style and MMDiT variants--and compare them with a standard DiT variant which directly processes concatenated text and noise inputs. Surprisingly, our findings reveal that the performance of standard DiT is comparable with those specialized models, while demonstrating superior parameter-efficiency, especially when scaled up. Leveraging the layer-wise parameter sharing strategy, we achieve a further reduction of 66% in model size compared to an MMDiT architecture, with minimal performance impact. Building on an in-depth analysis of critical components such as text encoders and Variational Auto-Encoders (VAEs), we introduce DiT-Air and DiT-Air-Lite. With supervised and reward fine-tuning, DiT-Air achieves state-of-the-art performance on GenEval and T2I CompBench, while DiT-Air-Lite remains highly competitive, surpassing most existing models despite its compact size.

Paper Structure

This paper contains 59 sections, 2 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: Comparison of text-to-image generation methods on two metrics, GenEval and T2I CompBench (higher is better for both). Despite significantly smaller model size, our proposed DiT-Air achieves state-of-the-art results. Note that, for our model, we report the full model size including text encoder and VAE. A detailed parameter breakdown is provided in Appendix \ref{['sec:appendix_sota_model_size_breakdown']}.
  • Figure 2: Sample images from our proposed DiT-Air, each with the text prompt below it. See Appendix \ref{['sec:appendix_samples']} for more examples.
  • Figure 3: Overview of Latent Diffusion Training. During training, $\mathbf{x}$ is encoded into a latent $\mathbf{z}_0$ via a VAE, and the text prompt $p$ is mapped to embeddings $\mathbf{c}$. A forward diffusion adds noise to $\mathbf{z}_0$, and the model learns to reverse this process by predicting the noise (or similar target) at each timestep.
  • Figure 4: Comparison of Diffusion Transformer Architectures. Element-wise operations are denoted by $\bullet$, and sequence-wise operations by $\circ$. The details of inputs $\mathbf{c}$, $\mathbf{z}$, $t$ can be found in Figure \ref{['fig:latent_diffusion']}. PixArt-$\alpha$ relies on sequential self- and cross-attention, whereas MMDiT uses a dual-stream approach with separate parameters for text and image tokens. Our proposed DiT-Air resembles a vanilla DiT that processes concatenated text and noises.
  • Figure 5: Validation Loss vs. Model Size for PixArt-$\alpha$, MMDiT, and DiT-Air. The plot illustrates the scaling behavior of three architectures across model sizes ranging from S to XXL, where the model size refers only to the diffusion transformer component (excluding the text encoder and VAE). The x-axis is in logarithmic scale, and the fitted lines depict the scaling trend using the formula $L = a \cdot S^b$. Among the three, DiT-Air achieves the best parameter efficiency.
  • ...and 3 more figures